This is intended to be a living document and will be added to and corrected over time.
A certain amount of technical ability is required as a modern researcher. From running simulations, running large calculations, applying data transformation and analysis or even automating mundane tasks such as renaming and moving data files, learning to program can really help improve your workflow and productivity.
This page is a compilation of resources and learning materials that can help with learning the skills researchers need.
Like anything technical, programming is full of terminology and jargon which can make it hard to get started. Sarah from the Lyndhurst STEM club found this great article which explains a lot of terminology to help you get started.
Often the software you use to carry out your research will influence your language choice, but if you’re starting from zero, Python is a great choice.
We’ve compiled a list of resources for learning Python from scratch.
Integrated Development Environments (IDEs) are text editors designed for writing code. They generally feature syntax highlighting, offer code suggestions and code debugging, allowing you to pause your program and examine it when checking for errors.
Python has many statistics libraries but the ones you will be using most is pandas, which offers data structures and operations for manipulating numerical tables and time series, and numpy which is a library optimised for performing numerical operations on large matrices. The two aforementioned libraries are used as the base part of many other scientific Python packages.
For data visualisation, the seaborn library provides a high-level interface over the venerable matplotlib library and makes it easier to plot good looking graphs. ggplot is another popular library that follows the principle of Grammar of Graphics. An interesting library is bokeh which is designed to generate integrative graphs that can be embedded in web pages.
There are Python IDEs with data science focus such as Rodeo and Spyder that provide an integrated data table browser and visualisation window alongside standard IDE functionalities.
Jupyter Notebooks is a platform that allows you to interactively write and run code, show its results and provide text annotations within a single document. The tool is great for exploratory coding or when you want to demonstrate a linear scientific workflow.
Here’s a demonstration of a notebook which is used to illustrate working through a machine learning exercise: https://nbviewer.jupyter.org/github/jdwittenauer/ipython-notebooks/blob/master/notebooks/ml/ML-Exercise1.ipynb
R is a statistical programming language for data analysis and visualisation that is widely used in the academia and industry. Many new statistical methods are first developed and made available in R but it provides much more with packages such as Shiny making it relatively easy to develop interactive dashboards.
R also forms the basis of the Bioinformatics software suite Bioconductor.
There are a lot of online resources available for learning R, as well as a plethora of books (see links below) for self-directed learning on specific topics.
The most common Integrated Development Environment (IDE) for R is RStudio, produced by Posit. It incorporates project management, integrates with RMarkdown and Quarto rendering and publishing and integrates basic Git functionality for version control. Other IDEs support writing R code though.
There are many books written in Bookdown on using R, a selection are listed below.
On the RSE Sheffield blog: A concise guide to reproducible MATLAB projects
Online, self-paced MATLAB training is available to all at TUoS via the MATLAB Academy. There are courses on more specific topics such as image processing and deep learning in addition to the fundamentals of MATLAB.
Comprehensive guidance on developing your own toolbox and the best way to do it: https://github.com/mathworks/toolboxdesign
TUoS researchers have access to a host of excellent training courses via LinkedIn Learning, covering an enormous base of different technologies. A good example is the ‘code clinic’ series (no relation to RSE Code Clinics) in which a similar set of problems are tackled in different languages, a good way to transfer your knowledge of one language to another:
However, there are great courses on almost any topic in research computing. Why not try one of these?
Note: You’ll need to be on the university VPN to get access.
No matter what programming language you use, we always recommend using version control systems such as Git in your project.
With version control, all changes made to your code are progressively tracked. This means you can always change or delete code with the confidence that you can always revert the changes if necessary. It becomes absolutely essential when collaboratively working on the same code. Additionally, any text file works well with version control systems e.g. latex documents.
With Git, a folder/directory is converted into a repository
that can track changes of all of its contained files and subdirectories. The repository can then be uploaded and synchronized with on-line services such as Github, backing up your work and enabling you to access it from anywhere in the world. See our remote working with github guide for more details.
Needing to get help on a particular piece of your code? It’s easy to share your on-line repository’s access with helpers e.g. Code Clinic.
Download and installation instructions for all platforms can be found at:
Familiarity with the command-line interface is required as git is a command-line tool.
The services below offer free hosting for personal repositories and some even offer free pro versions for academia.
GUI tools can really help with routine Git operations and especially when trying to make sense of large repositories.
When running large simulations or doing analysis on large datasets, your tasks can take hours or even days to complete, you may struggle to even load everything into memory or can’t fit the dataset or results onto your hard disk. If this sounds like something you’re facing, it may be time to start having a look at using the HPC (High Performance Computing) clusters provided through the University of Sheffield.
A HPC cluster essentially consists of a number of computers connected through a fast network. Each machine, or node, normally has far more CPU cores and RAM than the average PC. The best way to take advantage of a HPC cluster is to split your computation into small, independent tasks and distribute them to run concurrently across multiple CPU cores or nodes.
Access to these clusters are open to all UoS researchers and academics and they’re free at the point-of-use.
All HPC clusters mentioned above run on Linux operating system. Generally you have to interact with them using the command line interface (although for some you can launch a GUI application such as matlab though them).
As you have limited user permissions on these systems, you have to use pre-installed software (using modules) and custom installation of software is normally done through packages such as conda or spack that allows installation of software and their dependencies to your home directory.
HPC clusters use a job scheduler to make sure everyone has a chance to run their tasks. You write a job script that tells the HPC what tasks to run and what computing resources you need and then send it to the scheduler to be added to the queue. The scheduler will then run your job when it’s your turn in the queue and there’s enough resource available on the cluster.
For queries relating to collaborating with the RSE team on projects: rse@sheffield.ac.uk
Information and access to JADE II and Bede.
Join our mailing list so as to be notified when we advertise talks and workshops by subscribing to this Google Group.
Queries regarding free research computing support/guidance should be raised via our Code clinic or directed to the University IT helpdesk.