Analysing Machine Learning Models With Imandra¶
In this notebook we show how Imandra can be used to analyse and reason about models that have been learnt from data, an important and exciting topic bridging the gap between formal methods and machine learning (ML). Brief notes and some links are included, but for a fuller explanation (with code snippets included for reference) see our corresponding Medium post. You can also find all of our code for both learning and analysing our models on GitHub.
To illustrate this approach we'll be looking at (relatively simple) examples from two of the most common tasks within supervised learning (and ML more generally): classification and regression. In particular, we'll show how two of the most common kinds of model used to perform these tasks, random forests and neural networks, can be analysed using Imandra. For each task we'll use a real-world benchmark dataset from the UCI Machine Learning Repository and create our models using Python with some standard ML libraries.
We'll mostly be working with reals in this notebook so we'll start by installing a pretty printer so that we're not overrun with digits.
let pp_approx fmt r = CCFormat.fprintf fmt "%s" (Real.to_string_approx r) [@@program]
#install_printer pp_approx
Classification¶
In a classification task we want to learn to predict the label of a datapoint based on previous data. In the classic Wisconsin Breast Cancer (Diagnostic) dataset) the task is to predict whether the cancer is benign or malignant based on the features of cell nuclei. In the dataset we have the following variables:
1. ID number
2. Diagnosis (malignant or benign)
3-32. Real values for the mean, standard error, and the 'worst' value for each cell nucleus'
a) Radius
b) Texture
c) Perimeter
d) Area
e) Smoothness
f) Compactness
g) Concavity
h) Concave points
i) Symmetry
j) Fractal dimension
As is standard practice we pre-process the data before learning. First we standardise each variable to have zero mean and unit variance, then remove all but one from sets of highly correlated variables, along with those that have low mutual information with respect to the target variable. The data is split into training (80%) and test (20%) sets and we use Scikit-Learn to learn a random forest of 3 decision trees of maximum depth 3. As this is a relatively straightforward problem even this simple model achieves a fairly high accuracy. Using a short Python script each tree is then converted to Imandra Modelling Language (IML) and can be reasoned about using Imandra.
let tree_0 (f_0 : real) (f_1 : real) (f_2 : real) (f_3 : real) (f_4 : real) (f_5 : real) (f_6 : real) = let open Real in
if f_2 <=. (-0.10815) then
if f_0 <=. (0.26348) then
if f_6 <=. (-0.06176) then
(236.0, 1.0)
else
(17.0, 5.0)
else
if f_3 <=. (-0.54236) then
(8.0, 2.0)
else
(3.0, 7.0)
else
if f_6 <=. (0.09812) then
if f_6 <=. (-0.17063) then
(24.0, 0.0)
else
(4.0, 2.0)
else
if f_2 <=. (2.65413) then
(6.0, 128.0)
else
(7.0, 5.0);;
let tree_1 (f_0 : real) (f_1 : real) (f_2 : real) (f_3 : real) (f_4 : real) (f_5 : real) (f_6 : real) = let open Real in
if f_5 <=. (-0.05799) then
if f_0 <=. (0.68524) then
if f_1 <=. (-0.83180) then
(110.0, 3.0)
else
(137.0, 0.0)
else
if f_3 <=. (0.45504) then
(1.0, 8.0)
else
(0.0, 7.0)
else
if f_0 <=. (-0.18668) then
if f_6 <=. (0.45214) then
(39.0, 0.0)
else
(2.0, 11.0)
else
if f_6 <=. (-0.00009) then
(8.0, 4.0)
else
(5.0, 120.0);;
let tree_2 (f_0 : real) (f_1 : real) (f_2 : real) (f_3 : real) (f_4 : real) (f_5 : real) (f_6 : real) = let open Real in
if f_2 <=. (0.10459) then
if f_5 <=. (-0.38015) then
if f_5 <=. (-0.60659) then
(139.0, 1.0)
else
(44.0, 3.0)
else
if f_6 <=. (-0.07927) then
(38.0, 2.0)
else
(25.0, 17.0)
else
if f_6 <=. (0.46888) then
if f_3 <=. (0.41642) then
(28.0, 3.0)
else
(1.0, 4.0)
else
if f_2 <=. (1.74327) then
(3.0, 122.0)
else
(4.0, 21.0);;
let rf (f_0, f_1, f_2, f_3, f_4, f_5, f_6) = let open Real in
let (a_0, b_0) = tree_0 f_0 f_1 f_2 f_3 f_4 f_5 f_6 in
let (a_1, b_1) = tree_1 f_0 f_1 f_2 f_3 f_4 f_5 f_6 in
let (a_2, b_2) = tree_2 f_0 f_1 f_2 f_3 f_4 f_5 f_6 in
let a = a_0 + a_1 + a_2 in
let b = b_0 + b_1 + b_2 in
(a, b);;
We can create a custom input type in Imandra for our model, so that we can keep track of the different features of our data.
type rf_input = {
radius_mean : real;
compactness_mean : real;
concavity_mean : real;
radius_se : real;
compactness_worst : real;
concavity_worst : real;
concave_points_worst : real;
}
However, remember that we also processed our data before learning. To make things easier we'll add in a function applying this transformation to each input variable. Here we simply use some multiplicative and additive scaling values extracted during our data pre-processing stage. After that we can define a full model which combines these pre-processing steps and the random forest.
let process_rf_input input = let open Real in
let f_0 = (input.radius_mean - 14.12729) / 3.52405 in
let f_1 = (input.compactness_mean - 0.10434) / 0.05281 in
let f_2 = (input.concavity_mean - 0.08880) / 0.07972 in
let f_3 = (input.radius_se - 0.40517) / 0.27731 in
let f_4 = (input.compactness_worst - 0.25427) / 0.15734 in
let f_5 = (input.concavity_worst - 0.27219) / 0.20862 in
let f_6 = (input.concave_points_worst - 0.11461) / 0.06573 in
(f_0, f_1, f_2, f_3, f_4, f_5, f_6)
let process_rf_output c =
let (a, b) = c in
if a >. b then "benign" else "malignant"
let rf_model input = input |> process_rf_input |> rf |> process_rf_output
As our model is fully executable we can both query it as well as find counterexamples, prove properties, apply logical side-conditions, decompose its regions, and more. As a quick sanity check to make sure everything is working, let's run a datum from our dataset through the model. In particular, we'll input (17.99, 0.2776, 0.3001, 1.095, 0.6656, 0.7119, 0.2654)
which is classified as malignant
in the data.
let x = {
radius_mean = 17.99;
compactness_mean = 0.2776;
concavity_mean = 0.3001;
radius_se = 1.095;
compactness_worst = 0.6656;
concavity_worst = 0.7119;
concave_points_worst = 0.7119;
}
let y = rf_model x
Great, just what we'd expect. Now we'll use Imandra to generate an example datapoint for us given that diagnosis is benign
.
This looks a bit funny however; notice how the unspecified input variables are unbounded in a way that doesn't make sense with respect to the data. In general we might only care about the performance of our model when some reasonable bounds are placed on the input (for example, the mean radius can't be negative, and if the values for this variable in our dataset range between 6.98 and 28.11 we wouldn't really expect any value greater than, say, 35). Using the description of each variable in the dataset we can form a condition describing valid and reasonable inputs to our model. In machine learning more generally, we are typically only interested in the performance and quality of a model over some particular distribution of data, which we often have particular prior beliefs about.
let is_valid_rf input =
5.0 <=. input.radius_mean && input.radius_mean <=. 35.0 &&
0.0 <=. input.compactness_mean && input.compactness_mean <=. 0.4 &&
0.0 <=. input.concavity_mean && input.concavity_mean <=. 0.5 &&
0.0 <=. input.radius_se && input.radius_se <=. 3.5 &&
0.0 <=. input.compactness_worst && input.compactness_worst <=. 1.2 &&
0.0 <=. input.concavity_worst && input.concavity_worst <=. 1.5 &&
0.0 <=. input.concave_points_worst && input.concave_points_worst <=. 0.35
instance (fun x -> rf_model x = "benign" && is_valid_rf x)
This looks much better. Now let's move on to reasoning about our model in more interesting ways. One thing we can do is check the validity of certain constraints we might want our model to satisfy. For example, if the surface of a cell nucleus has many, large concave sections then is a particularly negative sign indicating that the cancer is likely to be malignant. We can use Imandra to easily verify that our model always classifies a sample of highly concave cells as malignant
.
verify (fun x -> is_valid_rf x
&& x.concavity_mean >=. 0.4
&& x.concavity_worst >=. 1.0
&& x.concave_points_worst >=. 0.25
==> rf_model x = "malignant")
The nested if ... then ... else
statements in how the trees are defined mean that they are a prime candidate for Imandra's region decomposition functionality. As well as the total model we can of course also decompose the individual trees making up the ensemble.