Software
This is my software page. Most of it relates in some way
to my research interests
in Bayesian computation.
All of my public and maintained code is now available from either Github or R-Forge - please check these first.
Please note that much of the code listed below is old, buggy and
unmaintained! I continue to make it available, as some of it is
still serves to illustrate how to go
about writing scientific software. However, if you are interested in
that topic, you might find following my blog to be
easier than trawling through my old code...
You should note that I have accounts on R-Forge, Sourceforge, Google code, and Github, and that many of my more substantial and better maintained software projects can now be found on those repositories.
I've decided to split stuff up by language
rather than by topic, which may seem a bit strange, but it makes more
sense to me. Files ending .tgz are gzipped tar files, and can be unpacked
on GNU/Linux systems with a command like tar xvfz foo.tgz. Note
that some of these should be unpacked into an empty directory -
do a tar tvfz foo.tgz first to check!
C isn't a particularly beautiful language, but it is very powerful and
efficient, and its ubiquity makes it hard to ignore. When I write C, I
generally write pure ANSI C. Most of my code relies on
the GNU Scientific Library, which makes scientific computing
in C much more practical. Because of its speed, C is my preferred
language for developing MCMC-related code. I have a set of
C
links and a set of links for the
GSL.
- GDAGsim is a C library for carrying out
conditional simulation for large sparse Gaussian DAG models. It is useful
for doing Bayesian inference in large hierarchical linear models.
(Last updated: 21/8/2002)
- gillespie.tgz is a simple simulator
for stochastic kinetic models. It accepts models described
using a subset of SBML
Level 1. See the enclosed README.txt for further
details. Some example models are
packaged separately. Note that Carole Proctor has adapted this
code so that it accepts SBML Level 2 models - that code is
distributed separately as gillespie2. (Last updated: 21/7/2004)
- gsl-sprng.h is a
bit of code which wraps up the SPRNG 2.0 parallel random number
generator as a GSL RNG, so that one can use all of the nice GSL random
number distribution functions in MPI-based parallel stochastic
simulation codes which rely on SPRNG for independent random number
streams. You need to have GSL, MPI and SPRNG all up and running on a
cluster before attempting to use this code. (Last updated: 28/8/2002)
- meparse.tgz is a mathematical expression
parser. It builds a parse tree which may be evaluated repeatedly for
different variable settings. Variables are handled by a user-defined
callback function. It is not well-documented, but see the README and
example code to get the basic idea. It is used by my "gillespie"
simulator. Please note that this parser is buggy! I do not
recommend using it! (Last updated: 22/1/2004)
- mlmu2.tgz is a simple bit of C code for
fitting unbalanced two-level normal models using a 2-block MCMC
sampler. (Last updated: 26/6/2000)
- mlmu3.tgz is the corresponding code for a
three level nested hierarchical normal linear model. Again, a 2-block
MCMC sampler is used. This is the code referred to in "Conditional
simulation from highly structured Gaussian systems, with application
to blocking-MCMC for the Bayesian analysis of very large linear
models". (Last updated: 26/6/2000)
- pbc.tgz is the example
code from my chapter on Parallel Bayesian computation in the
Handbook of Parallel Computing and Statistics. It includes all the
code for the case study on stochastic volatility modelling. (Last
updated: 11/2/03)
- stochInf.tgz is a package for carrying
out MCMC-based Bayesian inference for stochastic kinetic
models using time-course data. See the project web
page for further details. This code is now very out
of date - see the CaliBayes project for more
recent work. (Last updated: 21/7/04)
- sv.tgz is some code for fitting univariate
discrete time stochastic volatility models and multivariate factor
stochastic volatility models in the "Shephard and Pitt" style. It's
essentially just a port of my original Sather classes. (Last
updated: 9/6/04)
Java is now a reasonably good object-oriented programming language,
with a huge standard library which greatly facilitates application
development. It also has some reasonable scientific libraries, most
notably
COLT,
parallel COLT
and
Apache Commons Math.
- CaliBayes was a Java
based project I was involved in the development of. It is a
web-services based system for calibrating stochastic and deterministic
systems biology models encoded in SBML using time course experimental
data.
LISP-STAT is/was a wonderful environment for object-oriented interactive
statistical computing, but its reliance on LISP means that it will
only ever have a cult following. In fact, it seems to be slowly dying
a death, due partly to the success of 'R'. I wouldn't now recommend
LISP-STAT to people who aren't already familiar with it - 'R' is
probably a safer bet, despite its faults. I have some
LISP-STAT
links which are a useful starting point.
- BAYES-LIN
is my main LISP-STAT project. It provides a prototype system for
carrying out Bayes linear local computation. (Last updated: 7/4/2000)
Perl is a revolting but exceedingly useful language which is great for
text processing and related activities. I have a few
perl
links.
- tab2bugs takes a tabular MCMC output file
and turns it into BUGS-style output and index files ready for reading
into CODA. It requires an output file (such as produced by
mlmu2) with name of the form myoutput.tab which is
converted using tab2bugs myoutput. It makes multiple passes
through the file, avoiding the need for large amounts of memory. (Last
updated: 13/2/2002)
- thin is useful if you have a whopping great
output file which you can't read into CODA. Thin your tabular output
with this before converting. (Last updated: 4/4/2000)
- burn is useful for manually chopping some burn-in
iterations out of a tabular MCMC file. eg. you may want to do something
like myprog 100000 | burn 10000 | thin 10 > myout.tab.
(Last updated: 29/5/01)
- chop is for trimming the last line off the end of
an file - useful for inspecting the results of a running MCMC
code, as the last line of the output file is usually
incomplete in this case. eg. chop < myout.tab > snapshot.tab. (Last updated: 19/5/04)
Python is a wonderful object-oriented scripting language. I now use it
instead of perl for doing things such as text processing,
web/internet programming/CGI scripts, XML processing, database
interfacing, GUI development, etc. It has a
simple syntax (that many people refer to as "executable
pseudo-code"), and is in fact a great language to learn to
program with (especially object-oriented programming concepts). It
isn't as fast as Java, but it is much quicker and easier to develop
code in Python than it is in Java. Everyone should know Python! I have
some
python
links.
- SBML-shorthand and mod2sbml.py - a language for
describing systems biology models. (Last updated: 1/7/05)
- isbn.py is a python module for messing about
with ISBN-10 and ISBN-13 numbers, including validation, computation of
check-digits, and conversion from one form to the other. Do
python, import isbn, help(isbn) for full
documentation including examples. (Last updated: 11/8/07)
- mcmc.py is a
short script for massaging
tabular MCMC output files. Supersedes the above perl scripts
thin, burn and chop. Do mcmc.py
-h for usage info.
(Last updated: 26/1/05)
- validateSBML.py is a short script
which calls the BASIS SBML validation Web Service. Useful
where you have python, but not libSBML and the libSBML python
wrappers. Requires the python SOAP library. (Last updated: 22/1/05)
R is a very good general purpose environment for statistical
computing, with excellent graphical output and a huge range of
libraries and packages. It is technically inferior to LISP-STAT in
many ways, but its simple intuitive syntax means that it has a great following. R has now
become the de facto standard environment for serious interactive
statistical computing. I have some
links
for S, S-PLUS and R.
- Discrete
stochastic models test suite - a suite of
SBML models and correct output, used for testing the correct
behaviour of discrete stochastic simulators. The associated
test code is written in R. (Last updated: 12/7/2005)
- rdiric.r is a simple function for
simulating Dirichlet random quantities in a reasonably efficient
way. eg rdiric(10,c(1,2,3)) simulates a 10x3 matrix whose
rows are independent Dirichlet random quantities with parameter vector
(1,2,3). (Last updated: 10/12/2000)
- rmn.r is a simple function for simulating
multinomial random quantities in a reasonably efficient way. eg
rmn(100,1000,c(0.5,0.3,0.2)) simulates a 100x3 matrix whose
rows are independent multinomial random quantities based on 1000
trials with proportions (0.5,0.3,0.2). (Last updated: 6/6/2001)
- itosim.r is a function for straightforward
simulation of univariate diffusion processes. Examples of use are
included. (Last
updated: 10/12/2000)
- gammaprior.r is a function for prior belief
elicitation of variance components. Given a prior probability interval for
the standard deviation, it attempts to find a gamma prior on the precision
which is consistent. eg. For a 95% probability interval of [10,20] on the
standard deviation scale, gammaprior(10,20,0.95) will find an
appropriate gamma prior for the precision. (Last updated: 9/3/2001)
- plottab.r is a function for graphical display
of the contents of a tabular MCMC output file. Useful for having a "quick
peek" at some MCMC output before loading it into CODA for a more detailed
analysis. Requires the "ts" library to be loaded. (Last updated:
1/8/2003)
- fmc.r is a set of functions for messing around
with (discrete time) finite state Markov chains. Example of
use is included. (Last updated: 18/9/2002)
- These days, any non-trivial R project that I'm involved with I
develop on R-Forge as an R package. See my
R-Forge page for a list of projects that I am involved with.
Sather was (in principle) an excellent programming language. It is now obsolete.
It is a safe,
efficient
object-oriented language, well-suited for scientific computing, but
only has a cult following, partly due to the fact that the
only available compilers suck. In fact, sather too now seems to be
dying a death - why is it that the best languages always get
killed off by inferior competition?! My page of
sather
links provides plenty of information for the new user.
- kalman.tgz is a set of sather classes for
Kalman filtering, smoothing and simulation smoothing of dynamic linear
state-space models. It is well documented, and has a nice interface,
but is not implemented is a very efficient way. See the enclosed
README. Note that this package also contains a class RND2, which
provides a set of routines for multivariate random number generation,
which may be useful independently of the main classes. (Last updated:
27/5/2001)
- fsv.tgz contains classes for Bayesian MCMC
analysis of univariate and multivariate factor stochastic volatility
models in the "Shephard and Pitt" style. See the enclosed README for
further info. (Last updated: 27/5/2001)
SBML (Systems Biology Markup Language) is not a conventional
programming language, but an XML-based markup language for
describing the biochemical network models that arise in Systems
Biology. It is the closest thing to a standard in that area. I
have some links for
SBML and
XML.
- SBML-shorthand - a language for
describing systems biology models. (Last updated: 1/7/05)
- Discrete
stochastic models test suite - a suite of
SBML models and correct output, used for testing the correct
behaviour of discrete stochastic simulators. (Last updated:
12/7/2005)