Region Probabilities

In this notebook we provide an introduction to the latest Imandra Tools module, Region_probs, which includes a small probabilistic programming library that allows users reason about Imandra Modelling Language (IML) models probabilistically. In particular, users can easily create custom hierarchical statistical models defining joint distributions over inputs to their programs or functions, then sample from these models to get a probability distribution over regions, or query them with Boolean conditions. Alternatively, a dataset in the form of a CSV file can be imported and used as a set of samples. In a later notebook we'll illustrate how these tools can be used to analyse and verify the fairness of decision-making algorithms, but here we'll keep things simple and just demonstrate the basic functionalities of the module.

Setup

We'll begin by opening the module and installing a pretty printer that will display a decimal version of the real numbers that we'll often be using in this notebook:

In [1]:
open Imandra_tools.Region_probs;;
let pp_approx fmt r = CCFormat.fprintf fmt "%s" (Real.to_string_approx r) [@@program];;
#install_printer pp_approx;;
Out[1]:
val pp_approx : Format.formatter -> real -> unit = <fun>

Note that the module's interface and type signatures can be found in the docs pages.

Basic Distributions

The Region_probs module contains 13 of the most common discrete and continuous univariate probability distributions which can be parameterised and then used directly. In particular, we support the following distributions:

  • Bernoulli
  • Beta
  • Binomial
  • Categorical
  • Cauchy
  • Exponential
  • Gamma
  • Guassian
  • Laplace
  • Logistic
  • LogNormal
  • Poisson
  • Uniform

Each distribution has named parameters and can take an optional argument ~constraints which truncates the distribution to within the bounds specified by the constraint list. For example, we can define a Gaussian random variable x with mean 5 and standard deviation 1.3, that is restricted to the regions [2.5, 5] and [5.8, 10.3] as:

In [2]:
let x = gaussian ~mu:5. ~sigma:1.3 ~constraints:[(2.5,5.); (5.8,10.3)] [@@program]
Out[2]:
val x : unit -> real = <fun>

We then sample from the distribution to get a particular value by calling x with the unit argument ():

In [3]:
x ();;
x ();;
x ();;
Out[3]:
- : real = 6.24032854584
- : real = 3.67742661647
- : real = 6.16262476123

Building Statistical Models

Using the basic univariate distributions above as building blocks we can create more complex custom joint distributions over inputs by generating samples sequentially. For example, let's define a distribution over some apples in a shop. Let's consider three kinds of apple, Braeburn, Granny_Smith, and Red_Delicious which are stocked in varying proportions. We can also suppose that their weights might be normally distributed according to their kind. Let's assume that each apple stays in the shop for a week at a time and that there is a 0.4 chance that any particular apple gets bought on a particular day. Then the number of days an apple has been in the shop can be modelled by a Binomial distribution. Finally, let's say that the apples tend to turn ripe quickly after 3 days, if they're not already ripe. Thus, a joint distribution over apples can be given as follows:

In [4]:
type kind = Braeburn | Granny_Smith | Red_Delicious

type apple = { kind: kind;
               mass: Q.t;
               days_old: Z.t;
               is_ripe: bool }

let apple_dist () =
  let k = categorical ~classes:[Braeburn; Granny_Smith; Red_Delicious] ~probs:[0.25; 0.6; 0.15] () in
  let mean_mass = function
  | Braeburn -> 85.
  | Granny_Smith -> 80.
  | Red_Delicious -> 100. in
  let mu = mean_mass k in
  let m = gaussian ~mu ~sigma:5. () in
  let d_o = binomial ~n:7 ~p:0.4 () in
  let ripe_chance days = if days >= 3 then 0.9 else 0.4 in
  let p = ripe_chance d_o in
  let i_r = bernoulli ~p () in
  {kind = k; mass = m; days_old = d_o; is_ripe = i_r} [@@program];;

apple_dist ();;
apple_dist ();;
apple_dist ();;
Out[4]:
type kind = Braeburn | Granny_Smith | Red_Delicious
type apple = { kind : kind; mass : real; days_old : Z.t; is_ripe : bool; }
val apple_dist : unit -> apple = <fun>
- : apple =
{kind = Granny_Smith; mass = 87.8541616291; days_old = 2; is_ripe = false}
- : apple =
{kind = Granny_Smith; mass = 76.0184249847; days_old = 5; is_ripe = true}
- : apple =
{kind = Granny_Smith; mass = 83.3649215818; days_old = 2; is_ripe = true}

In general, distributions may be defined over abitrary groups of types, as long as they are outputted as tuples, in the end. For example, we could have a distribution over a list of fruits in a bowl and a distribution over the colour of the bowl, but the final output would be a tuple (fruits, bowl_colour) of type fruit list * colour.

Estimating Region Probabilities

There are two ways we can estimate region probabilities: the first is to sample from a model using a custom distribution (such as the one above), and the second is to use an existing dataset of samples whose distribution we might not be able easily write down. Given one of these we create a Distribution module which can then be used to get region probabilities, estimate Boolean queries, and save/load sample data. To estimate some region probabilities, we'll first need some regions! We'll right a simple function calculating the price, in pennies, of the apple then decompose it:

In [5]:
let price apple =
  let per_gram = function
  | Braeburn -> 0.4
  | Granny_Smith -> 0.35
  | Red_Delicious -> 0.3
  in let p = (per_gram apple.kind) *. apple.mass in
  if apple.days_old >= 5 then p /. 2. else p;;

let d = Modular_decomp.top ~prune:true "price" [@@program];;

let regions = Modular_decomp.get_concrete_regions d [@@program];;
Out[5]:
val price : apple -> real = <fun>
val d : Modular_decomp_intf.decomp_ref = <abstr>
val regions : Top_result.modular_region list =
  [{Constraints:
      [apple.kind = Braeburn && apple.days_old >= 5]
    Invariant:
      (2/5 *. apple.mass) /. 2};
   {Constraints:
      [apple.kind = Granny_Smith &&
       not (apple.kind = Braeburn) &&
       apple.days_old >= 5]
    Invariant:
      (7/20 *. apple.mass) /. 2};
   {Constraints:
      [not (apple.kind = Granny_Smith) &&
       not (apple.kind = Braeburn) &&
       apple.days_old >= 5]
    Invariant:
      (3/10 *. apple.mass) /. 2};
   {Constraints:
      [apple.kind = Braeburn && not (apple.days_old >= 5)]
    Invariant:
      2/5 *. apple.mass};
   {Constraints:
      [apple.kind = Granny_Smith &&
       not (apple.kind = Braeburn) &&
       not (apple.days_old >= 5)]
    Invariant:
      7/20 *. apple.mass};
   {Constraints:
      [not (apple.kind = Granny_Smith) &&
       not (apple.kind = Braeburn) &&
       not (apple.days_old >= 5)]
    Invariant:
      3/10 *. apple.mass}]
Regions details

No group selected.

  • Concrete regions are numbered
  • Unnumbered regions are groups whose children share a particular constraint
  • Click on a region to view its details
  • Double click on a region to zoom in on it
  • Shift+double click to zoom out
  • Hit escape to reset back to the top
decomp of (price apple
Reg_idConstraintsInvariant
5
  • not (apple.kind = Granny_Smith)
  • not (apple.kind = Braeburn)
  • not (apple.days_old >= 5)
3/10 *. apple.mass
4
  • apple.kind = Granny_Smith
  • not (apple.kind = Braeburn)
  • not (apple.days_old >= 5)
7/20 *. apple.mass
3
  • apple.kind = Braeburn
  • not (apple.days_old >= 5)
2/5 *. apple.mass
2
  • not (apple.kind = Granny_Smith)
  • not (apple.kind = Braeburn)
  • apple.days_old >= 5
(3/10 *. apple.mass) /. 2
1
  • apple.kind = Granny_Smith
  • not (apple.kind = Braeburn)
  • apple.days_old >= 5
(7/20 *. apple.mass) /. 2
0
  • apple.kind = Braeburn
  • apple.days_old >= 5
(2/5 *. apple.mass) /. 2

Now we're in a position to calculate the probability mass of each region given our distribution above, or a separate dataset.

Sampling Over Regions

Let's first use our apple_dist distribution from above to create a Distribution module:

In [6]:
module Apple_S = Distribution.From_Sampler (struct type domain = apple let dist = apple_dist end) [@@program]
Out[6]:
module Apple_S :
  sig
    type domain = apple
    val get_probs :
      Top_result.modular_region list ->
      ?assuming:string ->
      ?n:Z.t ->
      ?d:string ->
      ?step_function:(domain -> domain) ->
      ?num_steps:Z.t ->
      unit -> (int, float * Top_result.modular_region) Hashtbl.t
    val query :
      (domain -> bool) ->
      ?n:Z.t ->
      ?d:string -> ?max_std_err:real -> ?precision:Z.t -> unit -> unit
    val save : string -> ?n:Z.t -> ?d:domain list -> unit -> unit
    val load : string -> domain list
  end

Now we can find the probabilities using the get_probs function, which takes the regions and the following optional arguments:

  • n: The number of samples to be generated for calculating probabilities (only applies when using From_Sampler, default value is 10000)
  • d: The file name of the dataset of samples to be imported (only applies when using From_Data)
  • step_function: If the distribution is over the initial value of some type x and we have a function val f: x -> x that we are decomposing that acts a step in some transition system for x, then we can apply this step function to x multiple times before calculating region probabilities
  • num_steps: The number of times to apply the step function, if given
In [7]:
let apple_probs_S = Apple_S.get_probs regions () [@@program]
Out[7]:
val g___1 : apple -> int = <fun>
- : unit = ()
val apple_probs_S : (int, float * Top_result.modular_region) Hashtbl.t =
  <abstr>

Probabilities can be printed using the print_probs function, which takes the hashtable of region probabilities outputted by get_probs and the following optional arguments:

  • precision: The number of decimal places that probabalities are displayed to (default is 4)
  • full_support: Whether or not to list regions that have probability 0 or not (default is false)
  • verbose: Whether or not to print out a description of the region, including constraints and the invariant, alongside the probabilities (default is false)
In [8]:
print_probs apple_probs_S ~verbose:true
Out[8]:
Region Probabilities---[region]---
Index: 2
Probability: 0.0134
Constraints:
  [ apple.days_old >= 5
  ; apple.kind = Red_Delicious ]
Invariant:
  F = (apple.mass * 3/10) / 2
 -------------
---[region]---
Index: 3
Probability: 0.2234
Constraints:
  [ apple.days_old < 5
  ; apple.kind = Braeburn ]
Invariant:
  F = apple.mass * 2/5
 -------------
---[region]---
Index: 5
Probability: 0.1360
Constraints:
  [ apple.days_old < 5
  ; apple.kind = Red_Delicious ]
Invariant:
  F = apple.mass * 3/10
 -------------
---[region]---
Index: 4
Probability: 0.5436
Constraints:
  [ apple.days_old < 5
  ; apple.kind = Granny_Smith ]
Invariant:
  F = apple.mass * 7/20
 -------------
---[region]---
Index: 0
Probability: 0.0244
Constraints:
  [ apple.days_old >= 5
  ; apple.kind = Braeburn ]
Invariant:
  F = (apple.mass * 2/5) / 2
 -------------
---[region]---
Index: 1
Probability: 0.0592
Constraints:
  [ apple.days_old >= 5
  ; apple.kind = Granny_Smith ]
Invariant:
  F = (apple.mass * 7/20) / 2
 -------------

- : unit = ()

Datasets Over Regions

Using a dataset to estimate probabilities is very similar to the procedure above, although as a CSV stores strings as opposed to types we need to include a couple of functions mapping each row (a list of strings) to our data type:

In [9]:
let to_apple apple_string =
  let k = match CCList.nth apple_string 0i with
  | "bb" -> Braeburn
  | "gs" -> Granny_Smith
  | "rd" -> Red_Delicious
  | _ -> failwith "Apple type is not bb, gs, or rd" in
  let m = Q.of_string (CCList.nth apple_string 1i) in
  let d_o = Z.of_string (CCList.nth apple_string 2i) in
  let i_r = if CCList.nth apple_string 3i = "1" then true else false in
  {kind = k; mass = m; days_old = d_o; is_ripe = i_r} [@@program];;

let from_apple apple =
  let k = match apple.kind with
  | Braeburn -> "bb"
  | Granny_Smith -> "gs"
  | Red_Delicious -> "rd" in
  let m = Q.to_string apple.mass in
  let d_o = Z.to_string apple.days_old in
  let i_r = if apple.is_ripe then "1" else "0" in
  [k; m; d_o; i_r] [@@program];;
Out[9]:
val to_apple : string list -> apple = <fun>
val from_apple : apple -> string list = <fun>

With these two functions (and out apple type) we can define a model based on some data apple.csv and then use it just as we did with our previous model. Note that we load and save data from/into CSV files with data models, but as byte files with sampling models. Note that this final line is non-executable in this notebook, as we don't actually ahve access to the local CSV file

In [10]:
module Apple_D = Distribution.From_Data (struct
                                         type domain = apple
                                         let from_domain = from_apple
                                         let to_domain = to_apple
                                         end) [@@program];;
Out[10]:
module Apple_D :
  sig
    type domain = apple
    val get_probs :
      Top_result.modular_region list ->
      ?assuming:string ->
      ?n:Z.t ->
      ?d:string ->
      ?step_function:(domain -> domain) ->
      ?num_steps:Z.t ->
      unit -> (int, float * Top_result.modular_region) Hashtbl.t
    val query :
      (domain -> bool) ->
      ?n:Z.t ->
      ?d:string -> ?max_std_err:real -> ?precision:Z.t -> unit -> unit
    val save : string -> ?n:Z.t -> ?d:domain list -> unit -> unit
    val load : string -> domain list
  end
let apple_probs_D = Apple_D.get_probs regions ~d:"apple.csv" () [@@program];;

Estimating Query Probabilities

The last feature we'll introduce in this notebook is the query function which, given a statistical model, estimates how likely it is for a particular Boolean query to hold over an instance from that distribution. query takes in a Boolean function over the domain (representing some condition) and the following optional arguments:

  • n: The number of samples to be generated for calculating probabilities (only applies when using From_Sampler, default value is 10000)
  • d: The file name of the dataset of samples to be imported (only applies when using From_Data)
  • max_std_err: If estimating the probability by generating samples we can improve our estimate to within this optional bound on the standard error by taking more samples
  • precision: The number of decimal places that probabalities are displayed to (default is 4)

Let's begin by writing a simple query, then calculate its probability:

In [11]:
let rip_off (a : apple) = (price a) >=. 30. && not a.is_ripe
Out[11]:
val rip_off : apple -> bool = <fun>

Sampling Queries

In [12]:
Apple_S.query rip_off ~n:30000 ()
Out[12]:
Probability: 0.1196 ± 0.0019
- : unit = ()

Dataset Queries

Note once again that as we don't have access to apple.csv from this notebook, the following code is non-executable here:

Apple_D.query rip_off ~d:"apple.csv" ()

Optional: Changing the Pseudo-Random Number Generator

Sampling in the Region_probs module relies on access to a Pseudo-Random Number Generator (PRNG) that can produce uniform samples from the range [0,1). Presently we support three PRNGs which can be changed according to the user's preferences, although in practice this should make little difference. The different PRNGs are:

  • RS: OCaml's standard PRNG using module Random.State
  • GSL: GNU Scientific Library's Mersenne Twister PRNG using module Gsl.Rng

RS is the default option. The PRNG can be checked and changed using the get_prng and set_prng functions as follows:

In [13]:
get_prng ();;
set_prng GSL
Out[13]:
RS: OCaml's standard PRNG using module Random.State
- : unit = ()
- : unit = ()