We explain what mindsets and tools are important for you to master in order to be productive in this lab, and we describe why certain tools are useful. However, we only list the idioms you need to learn – you will have to learn these idioms on your own (the suggested reading materials will help you get started with this).
To begin, here are some principles and things to keep in mind as you are creating software:
Watch the Missing Semester videos – they cover many topics about the "craft" and social habits of a software engineer, like what is a good way to test code, how to think about the command line, what is a good text editor, how to use version management software, etc. In particular, you are strongly encouraged to watch the following lectures:
You can swing a plier to drive in a nail and you can twist a knife to unscrew a screw, but please never do that. Using tools improperly invariably leads to being bad at your craft. It does not show in one small task, but the poor quality compounds fast. It results in slower progress and lower quality of the final result. The most important lesson to learn from this document is that you have to learn to notice on your own if you are using a tool in a poor manner. Always ask yourself whether there is an easier way to do something. If a tool feels like a poor fit, always ask yourself why would someone make a stupid tool like that, and what would have been the idiomatic use the creator of the tool had in mind.
Below we have a list of important tools. Underlined are all the idioms that you have to learn in order for the tool to be your superpower (instead of it being a drain on your productivity). Read the official manual for the tool and then search for more resources on your own until you know the meaning of these idioms. Suggested readings are provided.
Hackers use the command line not to look cool or intimidating but because it is actually the easier, lazier, and faster tool when it comes to moving files around, searching for things, automating menial tasks, and compiling/installing/configuring software. Learn how to use the following baseline file manipulation tools:
to walk around:
ls, cd, pwd, mv, cp, mkdir, rm
to read or search in files and folders:
cat, less, grep, find
recursive flags: A lot of commands use the flag
-r to set "recursive" mode so that a command can be performed over an entire folder and all sub-folders; e.g.
grep -r would search in all files in all subfolders.
glob patterns: If you want to do something to many similarly-named files, it helps to use "glob" patterns. E.g. if you type
rm IMG*.png you would delete all files that start with
IMG and end with
.png. Also, never have whitespaces in your file names, rather use dashes or underscores, in order to make globs more reliable.
The program that interprets and executes your commands in the terminal is called a terminal shell. It is a full-fledged programming language, albeit a clumsy one, meant for interactive work, not for large scripts. By default it is probably the
bash shell, but other more modern shells are drastically easier to use. You are encouraged to install the
fish shell, as it contains features like auto-completing the commands you are typing, in-line help, detailed syntax highlighting, etc.
Do get in the habit of using the tab button to autocomplete your commands. Also get in the habit of using the command history, e.g. by pressing the up arrow to see previous commands you have executed.
Learn how to make simple loops in your shell so you can write things like
for file in experimentaldata*.txt python my_science_script.py $file end
cmd: the build-in terminal emulator; never use it, it will bring you only pain; you can not even copy-paste text reliably with it; it is a bad tool by the standards of the 1980s
Now on to shells, on Windows they would be
If you are going to use the terminal just to launch one or two commands, the Power Shell might be a good idea. If you are going to do a ton of programming and computational work, you will probably need to spend a couple of days learning how to set up WSL
Your Unix OS comes with a pretty thick manual. You can use
man some_command to read the built-in documentation for
some_command. You can use the
/ button to enter search mode. Use the man pages when having a question at least as often as you use internet search.
gitand other distributed version control systems
mercurial and others) let you track the history of your project. Think of them as undo-redo functionality for the entire file tree of text files for which:
A software developer uses
git so that they can keep track of various versions of their code and collaboratively make changes while working in a team. A scientist also uses
git to avoid situations in which a small change to their code led to different scientific predictions without a way to backtrack.
GitHub, GitLab, and BitBucket are three popular hosting providers for
git (also called "forges") that also offer convenient issue tracking, bug reporting, social collaboration features, etc.
Terms and commands to know about
git for personal use: commit, checkout, branch, pull, git-log and git-status (use the last two all the time in between other commands to double check the state of your repository).
Terms and commands to know about
git for collaborative use: remotes (origin and upstream), pull request, merge, rebase, cherry-pick,
Useful first-steps resources:
Avoid installing software that would modify your system in a way that is difficult to reverse:
A package manager is a tool that installs software for you, together with all necessary dependencies, in a controlled way, in a managed environment, permitting complete removal, and guaranteeing the rest of your operating system would not be harmed. These are a better way to deal with software installs than manually downloading random installers off the internet, however it might not always have the latest version of a piece of software.
Programming languages are some of the few things you might want to install independently of a package manager as you might need the very latest version. Programming languages also frequently have their own internal package managers for various libraries (e.g.
pip for Python,
conda for R and Python, and Julia’s
Never, ever, use
sudo to install a package outside of your operating system's package manager. For instance
sudo pip … and
sudo make install are guaranteed to ruin your day and cause severe complications down the line. If you are compiling something from scratch, install it in a local user directory, not in a system directory.
You will probably deal with a lot of "embarrassingly parallel" problems. Learn how to use your university’s computing cluster in order to submit thousands of jobs with varying input parameters. You will probably be using the SLURM batch scheduler.
sbatch commands, used to launch new jobs.
sinfo can tell you which partitions are a good choice to run on. Always set the
--cpus-per-task, --job-name, --mem-per-cpu, --partition, --time flags. Unless you are doing something fancy, always set
--nodes=1. Consider using
--array. Use the
--error flags with the
%a placeholders to store log files (and look at the log files, maybe with
tail -f). Use
squery -u to check the status of your jobs and
skill/scancel which are self-explanatory.
Virtual Machines, and more recently their simplified cousins, the Containers, are used to create well isolated and easily reproducible software environments. They are like a second separate “virtual” computer inside of your computer. Scientists use them mainly for their reproducibility: frequently a competent scientist who is not a particularly competent software engineer would create a very valuable software package that is incredibly difficult to install on another machine; instead of making the software easy to install, they might create a virtual machine that users download and run directly, having everything necessary already preinstalled. It is a relatively bad solution because it does not let you compose multiple packages together, but it is still a useful-to-have option if you are working with particularly complex / difficult to compile libraries. On the other hand, if you are a sysadmin it is a fantastic tool to have in order to create well-separated projects that all run on the same server, without worrying about them impeding each other.
A more reliable way to ensure reproducibility of your work, but also one that requires more discipline, is to use one of the build systems at your disposal. Makefiles is the prototypical one: a way to specify recipes for how something is done, i.e. a list of commands that depend on each other. It is used for compiling C code, compiling a LaTeX article, or even re-running experimental data analysis. Consider using them as a self-documenting way to describe your project. Using such a tool would force you to be more organized about your own work and it will make it easier for others to reproduce your work.
Separately, you might need to list the exact state of the environment in which you have been working: library versions, operating system, etc. While containers are incredibly convenient for this tasks, especially when your language of choice does not have good facilities for it, it is important to be more disciplined and provide detailed environment descriptions that can be reproduced without a virtual machine. In Python
pip freeze is a workable solution, while in Julia the
Manifest.toml files are superb way to document for reproducibility.
Idioms that you need to learn to be efficient in the use of a language are underlined below. Always read the official documentation and manual of a language – it might not be the first thing you read, but always read it before you start using the language intensely. Similarly, always skim the documentation for the entire standard library of a language at least once. Always skim the official documentation of a library you are going to use. Depending on your learning style, another resource might be best suited for first steps into a language or a library, but at some point you have to Read The Fine Manual (RTFM). Understanding the manual and knowing that you understand the manual is the litmus test you need to pass before using a tool, otherwise you will slow down your progres and frustrate your collaborators with your un-idiomatic use of the language. Make sure you know the difference between a script/notebook and a library/module/package, and know how to create either in your language of choice. Some languages have a good REPL that can be a very useful tool (e.g. Julia, IPython, R, especially when built into an IDE).
Your code will not work the first time you write it. You might think it does, but if you have not created a test suite, your code is almost certain to have serious bugs. Never trust yourself when writing code. Always test your code.
First, test it just by running it in simple situations in which you know what the results should be. Do "consistency checks", e.g., if you have a brand new summing algorithm, verify that it gives a positive answer if all the input terms being summed are positive. Feel free to generate random inputs. Coming up with a consistency check is itself an extremely useful exercise, deepening your understanding of the task at hand.
If you are creating a substantial piece of code, a library even, learn how to do unit testing, write test suites and doctests, and establish continuous integration.
When a test fails or a bug strikes, there will be a stack trace (a long error message listing which functions were being evaluated). Learn how to read them. Use an IDE that lets you with a single click to start editing the code referred to by the stack trace.
Learn how to use a performance and/or memory profiler for your code in order to know why it is slow. A debugger would be useful too so that you can pause and step through your programs.
If you are creating a larger project, whether public or internal, documentation is a must. Write doctests as well, and include documentation compilation and verification in your continuous integration pipeline. Read documentation.divio.com or diataxis.fr.
The workhorse of most of your code will probably be large structured sets of floating point data. If you are going to do a lot of linear algebra, it would be natural to use an array/matrix/tensor objects, but if you are going to work with statistics and data science, you have to learn about data frames and statistical graphing tools like "grammar of graphics". Data frames are a particular way to process arrays for statistical purposes that are much easier to use than bare arrays. Learn how to classify them as long vs wide data frames. Idiomatic operations on data frames are grouping, pivoting, melding, stacking. Data frames are built-in for
R, implemented in the
DataFrames.jl library for
Julia, and in the
pandas library for
Julia is a recent dynamical language with a peculiar compilation model that lets it have the rich, expressive, dynamic style of python, while being as fast as C. If you are going to do high-performance scientific computing, Julia is unsurpassed. However, the first import and the first execution of a function can be exceedingly slow, due to the fact that it has to compile the code.
One of the main programming paradigms of Julia is multimethods (discussed also at JuliaCon 2019).
Read the manual, understand the performance tips, workflow tips, style guide, FAQ, and differences to other languages. Ask a lot of questions on the Julia forum. The Julia Academy seems to have good lectures and the MIT Computational Thinking class is a good programming class using Julia.
Terms and commands to know about administering Julia: install with juliaup, manage separate projects with the built-in Pkg manager, and always work in a per-project environment, not in a global environment, e.g. by starting the interpreter as
julia --project=the_project_folder. Learn what the
Manifest.toml files are in your environment.
Libraries and their idiomatic use:
Arrayis sufficient for array - understand the
dot notation broadcastand the
BenchmarkTools.jl- constantly use the
@benchmarkmacro to check your code, always aim to have zero allocations in your code – they are the largest source of easy-to-optimize slowdown
Base.@inboundsmacros are baseline optimizations for your inner loops.
LoopVectorizations.@turbois a more advanced version. The
Base.Threads.@threadsis an easy way to run multithreaded computations.
Makie.jl- plots, notoriously slow to load due to compilation, extremely fast afterwards.
Gadfly.jl- statistical plots
Revise.jl- extremely helpful when editing code and wanting to avoid slow recompilations on restarting Julia.
QuantumOptics.jl- for Schroedinger and Lindblad equations
QuantumClifford.jl- for Clifford circuits
@benchmarkall the time; there are more advanced profilers and static analysers as well.
Python is a general-purpose object-oriented language. It became popular for science work, first through the
scipy libraries. If used correctly for linear algebra, it can have much of the speed of difficult to debug Fortran code, in a much easier-to-work-with interactive interface. It also has the benefit of an enormous ecosystem of useful non-scientific libraries, e.g. web dev.
On the sysadmin side of things, never, ever, not in a million years, use your preinstalled version of python. Only pain and suffering waits for those that walk such a path. Your operating system probably depends on it for its own general purpose tasks and weird things will start happening if you modify it. Never use
sudo with Python. Never use Python 2, only Python 3, unless there is an adult in the room (your PI is not an adult). Read and follow the PEP 8 style guide.
Terms and commands to know about administering Python: either
virtualenv environments or
Libraries and their idiomatic use:
numpyfor array/matrix/tensor calculations; idioms – vectorized operation, broadcasting, allocating vs in-place operations, views, the difference between an array and a python list
pandasfor data frames;
scipyfor an assortment of optimization, integration, equation solving, signal processing, and more;
matplotlibfor engineering plots;
seabornfor statistical plots;
scikit-learnfor old school machine learning (not neural nets)
tensorflowfor autodiff array/matrix/tensor calculations; idioms – automatic differentiation, gradient descent, optimizer
Cmodules for high-performance python
%timeitcell magics in Jupyter.
R is a powerful, open-source tool for data analysis that is built around data frames; and will become very useful as you move onto analyzing your data and creating your figures.
Particularly when you are new to R and/or coding, you may be tempted to use Excel for some of this work. Unless you have a very small dataset that you are trying to visualize and will never use it again, using R will save you time in the long run.
When writing in R, it is common to use some sort of notebook. These notebooks allow you to embed your code, code output, and commentary in one place, which will aid in both your exploratory work and sharing your R code with others. In addition to R notebooks, you can also use Markdown directly through RStudio. RStudio is an IDE that allows you to simultaneously visualize your code, variables, view figures, and more.
Terms and commands to know about R: data frames, long, wide,
Useful libraries: Tidyverse, ggplot2, readr
Resources: learning the tidyverse style, R commands cheatsheets
Do not use these tools unless this whole document felt trivial to you. Do not use them unless you can clearly describe why they are better than Julia or Cython/Numba/Jax. If you are going to use a tool like C/C++/Fortran/Rust/Zig, make sure you know what the preprocessor, compiler, and linker do. Make sure you know what a pointer is and how arrays work.
Document prepared in collaboration with Prof. Bridget Hegarty.