I need help to make good scientific software
Part of my job as a PhD student is to write data analysis software for a future space mission. Because the launch is about 10 years into the future, there is still a lot of R&D being done, so all that I am writing now is really a prototype. It may or may not be used in the final pipeline.
Now, I really want my contribution to this effort to survive the next decade. I want it to be there, even if only in essence, in the pipeline that will run the first analysis of the mission data. But whether my code will make the cut depends on many things. Some of them could be grouped into the umbrella of “software quality”.
Software of good quality, in this context, is something that will happily work in other machines, years or decades from now, with no hassle. That won’t crash, won’t require a special environment to be set up for it, won’t yell at the user about dependency version incompatibilities. That will scale up and down to the size of the problem at hand and the computing resources available. Something that will just do its job efficiently, silently, and above all correctly.
In practice that’s just an ideal. We physicists are not very good at making robust, interoperable, pleasant to use software. We don’t have much training on software, we mostly learn to write for loops that compute big formulas. But I want to try.
The plea for help
Now, given that time is finite and my thesis won’t write itself, I need practical advice on doing this (if you have some experience and are interested, please give me a hand!). Some specific constraints are the following:
- I am pushing for the use of Julia in an environment dominated by Python and C, but I am not actually experienced in Julia. I just think it is a wiser engineering choice.
- I am writing signal processing and Monte Carlo routines.
- This has to run in standard Linux x86_64 clusters (slurm, kubernetes).
- My routines should be easy to call from Python and C for interoperability.
- We might want to throw GPUs at the problem.
What I thought so far
From my little experience I know that writing unit tests by hand takes time and can only catch the bugs you anticipate anyway; inline snapshot testing at least helps by making that easier. Random test generation seems promising and I will try out Supposition.jl very soon. Formal methods stuff like Alloy and TLA+ look fun but I have way too little time to learn them, and besides they seem most useful for concurrent algorithms which is not what I am doing.
On the robustness and ease of installation side of things, I want to avoid dependencies like the plague, unless they are very stable.
Fortunately for now I depend only on (other than Julia itself) the C library fftw and its Julia bindings. Those look stable enough to me that I shouldn’t worry about using them. I am wondering if I should vendor them to make things easy for the users and myself. Other Julia packages that I use (Revise, PyPlot) are only for development and so I don’t count them. Supposition.jl is an interesting case; if I do end up using it I should have some test suites that depend on it. That is still just a development dependency, but unlike Revise and PyPlot, its usage goes into the source tree. I have no idea what I should do then.
For interoperability with C, I know that compilation of Julia code into C libraries is possible but I never did it. The other direction should be easy.
For interoperability with Python, there is PythonCall which I hope won’t break too much. But if we are ambitious, in the long term this is not needed: all current Python code can be replaced by Julia. Not in the timeframe of my PhD, though.
Where that leaves us
This week I will try out this fancy property driven development thing. It looks like it can be very effective, and fun. Hopefully this won’t distract me too much from the actual thesis goals.
Anyway, cheers!
tags: software