Thinking Formally

Writing formal models requires a slightly different approach for thinking about software. Just like developers strive to write “clear code”, there’re certain techniques to write “clear” code so it’s easier for the machine to reason about and for you to formulate questions about it.

Perhaps the first important point to discuss is that IML is based on the pure subset of OCaml (www.ocaml.org - which means that it DOES NOT allow code with side-effects. In plain English, this means that if you write a function, nothing in it can influence values declared outside of it. All of the changes that are made to the global state of the program should be included in the return value of function. This is not a fundamental limitation - you can convert programs with side-effects into formal state machine models where you “carry the state” in your code. We’ll discuss this in detail later. We’ve encoded incredibly complex systems this way and the underlying techniques go way back to the seminal work by Boyer and Moore.

Here are some other key elements of writing models in IML that are important to cover:

  • Static types - IML is statically typed functional programming language which means that variable datatypes are determined before a program is run and they cannot be changed later. So things like assigning a variable to None, then a string, and finally an int that you might see in dynamically typed languages like Python are a “No go” in IML. If the types are mismatched in the model - for example if you’re attempting to add an integer with a string, ImandraX will return an error. In a sense, this is also a form of logical reasoning. Languages like Python, while dynamic, also have tools to help infer type information - CodeLogician uses them to extract type information from the source program when it synthesizes IML models.
  • Recursion is used to model (besides recursive functions) loops. So, both while and for loops will be rewritten as recursive function calls. Reasoning about these types of control structures ultimately requires reasoning by induction and rewriting them as recursive function calls makes the inductive step easier to identify.
  • State machines - these are typically infinite state machines because the state type may contain complex datatypes (e.g. lists of data structures) in addition to the regular arithmetic types. State machines are a useful mechanism for modeling code with side-effects (where the variable outside the scope becomes part of the state) and, more generally, complex systems like stock exchanges. For example, you would the stream of incoming order messages as a list that the top-level step function iterates through. Our Demo Gallery page and publication section contain many different examples.

What should we model (let’s talk about abstraction)?

CodeLogician is an agent that uses ImandraX to build and reason about formal models, so the question “What type of source code should I apply to CodeLogician?” should be instead reformulated as “What type of code and how could/should I model in ImandraX?” While any type of executable program (even those that encode hardware design) could be modelled in ImandraX, how you go about it depends on what your objectives are.

This is where the concept of “abstraction” comes in. Let’s say, as an example, we have some code in C that sorts a list of integers involving the typical memory operations (e.g. memory (de)allocation). “Reasoning” about the program may involve making sure there is referencing deallocated memory or that the resulting list is, in fact, sorted. In ImandraX you can do both:

  • Reason about memory operations, but this would require a faithful model (in ImandraX) of the memory manager and the semantics for relevant operations
  • Reason about whether the list that’s returned is sorted - this would not require a model of memory manager The second point above is considered “higher” level of abstraction than the first (involving memory procedures). The features you want to reason about dictate the level of abstraction you will focus on and ultimately how you will go on to create the formal model. Perhaps a useful way to organize these levels is through these three categories:

  • High-level - this category (typically) involves DSLs (Domain Specific Languages) are used to represent complex processes, systems, etc. In this case a program in one of these languages would be translated into IML with high-level of compression. For example, our formal DSL Imandra Protocol Language (IPL) which we use to model symbolic state transition systems (e.g. a complex trading API) has 1:10 compression ratio for IML - 1 line of IPL, on average, results in 10 lines of IML. For this category, we don’t recommend using CodeLogician, but rather developing a DSL that ultimately translates into ImandraX. For IPL, we use Langium framework by TypeFox (and its predecessor Xtext/Xtend).

Note: We’re currently working on another agent, SpecLogician, which encodes (inspired by Cucumber!) user stories expressed in Given/When/Then sequences in English into a formal representation in IPL for subsequent analysis of state reachability and much more.

The reasoning at this level deals with non-trivial state transitions, validations of data input/output represented by rich structures (conditioned on the current state), etc.

  • Mid-level - Application level software is another way to think of this category - it’s most of the software that you interact with on a day-to-day basis. It is the “sweet spot “ application type for CodeLogician - for these programs, you can encode source programs into models directly, without having to go via a DSL.

  • Low-level - this is also a great application of ImandraX, but is outside the scope for CodeLogician. Typically analyzing such code in ImandraX would require creating a virtual machine representing memory manager, CPU (for giving meaning to the instructions), etc. For example, at Imandra we created the first formal model of the Ethereum Virtual Machine - this required not only defining the instructions, but implementing the actual executable model of the virtual machine so we could reason about the execution of those instructions. If you were to simply ask CodeLogician to analyze a set of instructions at such low level, it would not have access to the virtual machine which is what’s required to understand the semantics (aka meaning) of those instructions.

Dealing with the unknown

Very rarely will you have code that doesn’t make use of third party APIs or libraries. When you run code that relies on those services and libraries, you’re making assumptions that those services and libraries work in a certain way. In the example below, we use a pseudo random number generator - very few developers that use that code have actually looked at the actual implementation and proved to themselves that the code behaves in a certain way. Without making such assumptions, modern software development would simply not scale as implementing everything from ‘scratch’ is simply infeasible.

When developers run tests that make use of third party code and services and they want to have reproducable results, one technique often used is mocking - where developers create tiny approximations of those outside artifacts so they can be used during these tests.

When we formalize source code into logic via IML, we use very similar techniques - we need to explicitly state our assumptions and when we want to be able to execute models, need to create approximations of those unknown functions.

One difference is that in addition to handling outside libraries and source code, we also need to deal with things that are difficult for us to reason about from the mathematical perspective.

Opaque functions

opaque functions within ImandraX allow us to introduce functions into our model whose implementation we know nothing about except for its signature (input and output types). By introducing a function with that decorator ([@@opaque]), we can use it in other code so it ‘type-checks’ and we can even prove properties about that code (that references opaque functions). But if we prove properties about code that calls opaue functions, that reasoning will not assume anything about logic within those functions, UNLESS we introduce axioms which we’ll describe in more detail later.

Let’s consider a simple fragment like this:

import random

def hello ():
  a = random.random()
  if a > 0.4:
    return 1
  elif a >= 0.0:
    return 2
  else:
    return 3

We have a good idea of what the function random.random() does - it should generate a pseudo random number in the range [0, 1]. That is what we “assume” since we typically would not spend the time to trace the library that we’re using to the Python implementation.

The default formalization pass by CodeLogician will realize that the random.random comes from a library and will introduce it as an opaque function. It will look like this:

let random: real -> real [@@opaque];;

let hello(x) =
  let a = random(x) in 
  if a >. 0.4 then 1
  else if a >=. 0.0 then 2
  else 3;;

You can think of the input x to random as the “seed” value. The status of this formalization will state that opaqueness is present - while code that relies on opaque functions will not be executable (can’t generate test cases), we can begin to verify properties about it. Since we have only declared that the opaque function random exists, all of our proofs will simply assume that it returns a real value.

Adding assumptions

Ok, so we can define what the types of opaque functions are, but can we also add additional information? For example, while we still haven’t encode the full logic of the pseudorandom number generator in the random library, we know that the return value should be in the range [0..1].

Let’s revist the simple model above and add another branch:

let random : unit -> real = () [@@opaque];;

let hello(x) =
  let a = random(x) in 
  if a >. 0.4 then 1
  else if a >=. 0.0 then 2
  else 3;;

Let’s first try to verify that hello can only equal either 1, 2 or 3. We’ll do this with a verification goal like this:

verify (fun x -> let t = hello(x) in (t = 1 || t = 2 || t = 3));; (* Proved! *)

Let’s say that now we want to add an assumption that our opaque function random will, in fact, return a real value that will strictly be within [0,1] range - we’ll call this axiom bound1. We’ll also introduce another axiom called bound2 which will further limit the possible output random of the random function and see how it affects what we can prove about hello.

(* This is an opaque function - while we provide the "body" of the function, it is disregarded when ask ImandraX to reason about it *)
let random : unit -> real = () [@@opaque];;

(* "axiom"s represent our assumptions that we encode and use when reasoning about "opaque" functions *)
axiom bound1 x = let a = random(x) in (a >=. 0.0 && a <=. 1.0);;
axiom bound2 x = let a = random(x) in (a >=. 0.0 && a <=. 0.3);;

(* Let's define a simple function that depending on 'random' returns one of three possible values *)
let hello () = 
  let a = random() in 
  if a >. 0.4 then 1
  else if a >=. 0.0 then 2
  else 3;;

(* Notice no axioms used! *)
verify (fun x -> let t = hello(x) in (t = 1 || t = 2 || t = 3));; (* Proved! *)

(* Now we introduce the first axiom `bound` - so `random` now returns a `real` value in the range [0..1] *)
verify (fun x -> let t = hello(x) in (t = 1 || t = 2)) (* Since we limited possible values of `random`, the output of "hello" is now only [1,2]. 3 is now unreachable! *)
[@@by [%use bound1 x] @> auto];;

(* In the second axiom, we limit the output of "random" function further - it is now [0..0.3] *)
verify (fun x -> let t = hello(x) in (t = 2)) (* Because we applied `bound2`, the output of "hello" is now limited to only `2`! *)
[@@by [%use bound2 x] @> auto];;

This time the VG succeeds because we have added the assumption.

Approximation functions

Most often creating good approximations for opaque functions will be sufficient. For example the one that deals with mathematical functions (e.g. Python’s math library), we have available libraries of approximating mathematical functions. For tigonometric functions, for example, we use bounded Taylor series expansion. So, for sin function, this would be:

let sin (x : real) = 
    (* Truncated Taylor series approximation of `sin` for n=5 *)
    x -. 
    (x *. x *. x) /. 6.0 +.
    (x *. x *. x *. x *. x) /. 120.0 -.
    (x *. x *. x *. x *. x *. x *. x) /. 5040.0 +.
    (x *. x *. x *. x *. x *. x *. x *. x *. x) /. 362880.0 -.
    (x *. x *. x *. x *. x *. x *. x *. x *. x *. x *. x) /. 39916800.0

CodeLogician contains a number of approximations for most popular functions (and this database is constantly growing!) and you can always use your own approximations to build out the model (there’s a command for that).