Command line basics

Basic proficiency with the Unix shell is essential for anyone who wants to start doing computational work outside of their laptop and Excel. Unix shells provide an interface for interacting with Unix-like (Mac OS, Linux, etc.) operating systems and a scripting language for controlling the system. The Unix philosophy is a set of software engineering norms and concepts that guide how the tools of the Unix shell interact with one another. Learning a few of these command line tools, and how they can be strung together into what are called “pipes”, is a powerful skill for developing quick and composable bioinformatics programs. Here, we’ll describe some essential commands to get you started using the command line.

This tutorial is very brief. If you want a more in-depth tutorial be sure to check out the Software Carpentries course on Unix.

Accessing the terminal

First, you’ll have to open the Terminal application. If you’re on Mac OS, the quickest way to access your terminal is: “command + space”, typing “terminal” and pressing Enter. On Windows, you’ll have to install Windows Subsystem for Linux which will allow you to interact with a (default) Ubuntu OS. These default terminals are not the only options for interacting with your computer on the command line. For Mac iTerm2 is a popular terminal emulator. Alacritty is a cross-platform terminal.

Once you’ve opened the terminal app, you’re ready to start typing commands at the command line.

Where am I?

The first command you should know is pwd. pwd will print your current working directory. This command is used to display where you are currently in the filesystem. For example, if I open a terminal window in my “Downloads” directory and type pwd and hit ‘Enter’, It will return something like:

/home/gennaro/Downloads

indicating that I am in my “Downloads” directory. It can be informative to take a look at how the Unix filesystem works if you are unfamiliar with looking at files on the command line. Basically, all files on a Unix system begin at /, which represents the ‘root’ of the filesystem. All files and directories are accessed by traversing this tree to its tips, which represent files or directories.

Listing files

Now that I’m in my “Downloads” directory I want to see what files I’ve downloaded. To do this, I can use the ls command to ‘list’ files in the directory.

ls

which returns:

BDNF-data.tsv  CORI_Candidate_SNP_draft_250528_clean.docx  differential-expression2.tsv

Your “Downloads” directory will of course have different files. If I need to display more information about these files, such as the time that they were created or how large they are, I can supply the ls command with arguments. For example:

ls -lah

Returns

total 15M
drwxr-xr-x  2 gennaro gennaro 4.0K Jun  1 14:52 .
drwxr-x--- 51 gennaro gennaro 4.0K Jun  1 09:34 ..
-rw-rw-r--  1 gennaro gennaro 6.8K May 30 18:00 BDNF-data.tsv
-rw-rw-r--  1 gennaro gennaro 973K May 30 17:34 CORI_Candidate_SNP_draft_250528_clean.docx
-rw-rw-r--  1 gennaro gennaro  14M May 30 12:54 differential-expression2.tsv

Which provides information about the file permissions, the file sizes, and when the files were created.

Learning more about a command

To learn more about what arguments are available to any of the command line programs you run, you can use the man, or manual, command. This command will open the user manual for the given command.

Try typing man ls to view all of the options available when listing files with ls. Use the ‘q’ key to exit the manual information.

Moving around

Let’s say I want to move from my “Downloads” directory to my “Documents” directory. The command I have to use is cd, short for “change directory”. We can use the cd command with the argument for the target directory we want to go to. For example, to move to my “Documents” directory I can use cd with the full path to the directory that I want to change into. Since it can become incredibly cumbersome to type entire paths into the shell most terminals use the “tab” key to autocomplete the path. So if you start typing ls Do and hit “tab” in your home directory you should see “Documents” and “Downloads” appear as options that can be autocompleted.

cd /home/gennaro/Documents

Directory shortcuts

The shell has a few shortcuts that make moving around a little easier.

  • cd: Running cd alone without any arguments will bring you back to your home directory
  • cd ~: the ~ character is also a shortcut for /home/user. So you can use that in place of the full path. For example, to change to the Downloads directory cd ~/Downloads
  • cd ..: Will allow you to move “up” the directory tree. So to move from “/home/gennaro/Downloads” back up to “/home/gennaro” I can run cd ...
  • cd -: Will allow you to return to the last directory that you were in. For example, if I was in my “Documents” directory, changed to “Downloads” and then wanted to get back into “Documents” I could simply use cd -.

There are many more shortcuts you can learn as you become more fluent on the command line.

Creating files

One way to create files is with the touch command. For example, to create an empty file in my “Downloads” directory called “A.txt” we can run:

touch ~/Downloads/A.txt

Redirection

Another more common way to add contents to a file is using a concept called redirection. Redirection is a core concept in Unix pipes. It allows you to take the output from one program and use it as input to another program.

In this example, I’ll take the output from echo and redirect it to the file “A.txt” that we just created. echo simply prints its arguments back out to the terminal using what is called “stdout” or “standard out”. If we instead want to ‘redirect’ this output (stdout) to a file we can use the >> operator which will append the output from the echo command to the file.

echo "This is a new line in the file" >> ~/Downloads/A.txt
echo "Here is another new line in the file" >> ~/Downloads/A.txt

The >> took the output of the echo command and inserted it as a new line in “A.txt”. Importantly, >> appended these lines into “A.txt”. If I were instead to use > like:

echo "This will replace the current contents of A.txt" > ~/Downloads/A.txt

“A.txt” will be overwritten with the new contents. > is more common to see in practice. In bioinformatics you’ll often see commands that take the output of a program and redirect the results to a new file with >.

The final essential operator is the pipe |. The pipe lets you take the output (stdout) from one program and use it as input (stdin - “standard in”) to another. I’ll show an example of this later. The pipe operator technically handles streams differently from the other redirection operators but for our purposes imagine it as moving data from one program into the next.

There are other redirection operators that you’ll come across for handling stdout and stdin as separate streams these are not essential to know right now but are useful to have a solid grasp of as you advance.

Displaying the content of files

The simplest way to display the contents of a file on the command line is by using the cat command. The cat command is actually designed to concatenate file together, but running it on a single file will print the entire contents of the file to the command line (AKA stdout). For example, to print the contents of “A.txt” in you terminal:

cat ~/Downloads/A.txt

Will print

This will replace the current contents of A.txt

to the console. If you have a lot of text that you would like to display cat can result in too much information being on the screen. Instead, you can use the less command. less will print the contents of the file as pages on the screen. You can use the d key to scroll down a page, or the u key to scroll up a page.

Another way to display only some of the contents of a file is to use the head or tail commands. head -n10 will print the first 10 lines of a file, whereas tail -n10 can be used to print the last 10 lines of a file.

cat is also useful for actually combining the contents of files as well. So something like cat A.txt B.txt > C.txt can be used to easily create a new file called “C.txt” with the combined contents of “A.txt” and “B.txt”.

Copying files

You can copy a file using the cp command. For example, to copy the “A.txt” file into a new file “B.txt” I can use

cp ~/Downloads/A.txt ~/Downloads/B.txt

To copy an entire directory you need to supply the -r, or recursive argument to the cp command. For example, to create a copy of my “Downloads” directory inside of my “Documents” directory

cp -r ~/Downloads ~/Documents/Downloads-copy

Moving and renaming files

The mv or “move” command can be used to move files and rename them. For example, to move the “A.txt” file into my “Documents” directory I can use:

mv A.txt ~/Documents

If I want to change the name of that file I can also use the mv command. Now you need to specify the new file name instead of the location to move the file to

mv ~/Documents/A.txt ~/Documents/C.txt

Making new directories

To make a new directory you can use the mkdir command. To make a new directory inside of my “Downloads” directory I can use

mkdir ~/Downloads/textfiles

By default, the mkdir command doesn’t allow you to create nested directories. To enable this, set the mkdir -p flag. For example I can create a parent folder and subfolders using:

mkdir -p ~/Downloads/imagefiles/jpegs

Shell expansion

Another useful trick is to learn shell expansion. Shell expansion ‘expands’ the arguments. Shell expansion can be a shortcut when creating new project directories. For example

mkdir -p data doc scripts results/{figures,data-files,rds-files}

The results/{figures,data-files,rds-files} expands this command into

mkdir -p data doc scripts results/figures results/data-files results/rds-files

Which saves some typing. Shell expansion can also be used in other contexts. For example, I can create 260 empty text files using the following command which ilustrates the .. expansion.

touch ~/Downloads/textfiles/{A..Z}{1..10}.txt

Another useful shell expansion is *. For example, if I needed to display the contents of each of the file we just created I could run

cat ~/Downloads/textfiles/*.txt

The * symbol is just one example of a “wildcard” character. These wildcards will come in handy as you advance in your shell usage. They allow you to select patterns to match rather than having to exhaustively list all of the files you may need.

Removing files

Removing files on the command line can be done with the rm command. Unlike when using a GUI, when you remove files on the command line you cannot get them back so use rm wisely. To remove one of the empty files that we just created we can use:

rm ~/Downloads/textfiles/A1.txt

If I want to remove the entire “textfiles” directory I need to use the -r, or recursive flag with rm.

rm -r ~/Downloads/textfiles

Be careful when using rm. A simple space can mean removing entire file systems by mistake!

Finding files

One incredibly useful but often overlooked command line tools is find. find does exactly what you expect it to do, it finds files and folders. find has many arguments but the simplest usage is for finding files using a specific pattern. For example, to find all text (.txt) files in a particular directory and all of its subdirectories you can use:

find . -name "*.txt" -type f

This command says, “find any file (-type f) that has a name like ‘.txt’”. find is especially powerful when combined with the -exec argument. For example, to remove all .txt file in a directory you can use:

find . -name "*.txt" -type f -exec rm {} \;

# Alternatively, depending on your system
find . -name "*.txt" -type f -delete

Downloading files

curl and wget are both command line utilities for downloading files from remote resources. curl will download and stream the results to your terminal by default. wget will save the result to a file by default.

Try running the following to download the The Complete Works of William Shakespeare from Project Gutenberg

# Prints the contents to stdout
curl https://www.gutenberg.org/cache/epub/100/pg100.txt

# Saves the contents into pg100.txt
wget https://www.gutenberg.org/cache/epub/100/pg100.txt

Searching the contents of files

grep is a tool that’s use to search the contents of files for specific text patterns. For example, if you wanted to find every line in a text file that contains the word “the” you could use:

grep "the" pg100.txt

grep also has many useful arguments. One of the most useful is that grep can return the count of the number of lines that are returned. For example, to count the number of lines in a text file that contain the word “the”:

grep -c "the" pg100.txt

grep is one of those commands that you’ll use a ton and has a bunch of options for returning the resulting lines. Mastering grep will make your life easier.

awk

AWK is another command line tool that is also a programming language and is used quite extensively in bioinformatics for matching patterns, filtering text, and extracting lines from delimited files. Because awk is a programming language learning awk is outside the scope of this tutorial. Tools like bioawk have even been developed specifically with bioinformatics file formats in mind.

Replacing file contents

sed is a command line tool for finding and replacing text in a file. sed has arguments that allow you to replace file contents ‘in-place’, meaning the file gets changed permanently. This can be good for really memory intensive operations but can also be really dangerous. the basic syntax of a sed command is:

sed 's/find/replace/' myfile.txt

So if you wanted to replace all of the instances of ‘the’ in the pg100.txt file with the word “ELEPHANT” you could use: sed 's/the/ELEPHANT/' pg100.txt > shakespeare-elephants.txt.

Piping commands

The Unix pipe is what makes the command line so powerful. You can string together small programs to build up solutions to complex problems. The pipe allows you to take the output from one program and use it as input to another program directly.

For example, suppose we wanted count the top 10 most frequently used words across all of the works of Shakespeare. The commands below are pretty complicated but illustrate how multiple single tools can interact with eachother through the pipe operator |. This modular workflow is a core tenet of the Unix Philosophy.

curl https://www.gutenberg.org/cache/epub/100/pg100.txt | \
sed 's/[^a-zA-Z ]/ /g' | \
tr 'A-Z ' 'a-z\n' | \
grep '[a-z]' | \
sort | \
uniq -c | \
sort -nr -k1 | \
head -n10
  • curl downloads the text file from Project Gutenberg and streams it to stdout
  • sed replaces all characters that are not spaces or letters, with spaces.
  • tr changes all of the uppercase letters into lowercase and converts the spaces in the lines of text to newlines (each ‘word’ is now on a separate line)
  • grep includes only lines that contain at least one lowercase alphabetical character (removing any blank lines)
  • sort sorts the list of ‘words’ into alphabetical order
  • uniq counts the occurrences of each word
  • sort sorts the occurrences numerically in descending order
  • head shows the top 10 lines

Compressing and uncompressing

Bioinformatics and command line tools can generally work with compressed data. Data compression saves space which can be really beneficial when transferring files over the internet. gzip is an old but commonly used compression utility that is compatible with many command line utilities. To compress a file for example:

gzip pg100.txt

Will produce a compressed version of the “pg100.txt” file called “pg100.txt.gz”. Many Unix tools can work directly with gzipped files. For example:

zcat pg100.txt.gz

Will unzip and print the contents of the file to the terminal and zgrep can be used to directly search the contents of a gzipped file without the need to decompress the entire file

zgrep -c "the" pg100.txt.gz

To decompress a file you use the ‘un’ version of the compression command, gunzip.

gunzip somefile.txt.gz

There are other compression formats to be aware of in bioinformatics. The most important is “blocked gzip” or bgzip. bgzip is a format that is compatible with gzip but utilizes ‘blocks’ of data. These files are important for building indices of large files. Indexes let you hop to individual lines in massive files without having to search through the entire file from the top down.

For-loops

The command line is also a scripting language and like any scripting language, it provides some basic control flow utilities. One of the more useful of these is the basic for loop. In Bash, the for-loop takes the form of a for each loop. the looping variable can be referred to in the loop by using the $ syntax.

For example, to loop through all text files and return the top words from each:

# For each file that ends in .txt:
for F in *.txt; 
 do
  # Sort the file ($F is a variable), count unique occurrences, sort numerically, 
  # return the top results and append to the 'top-words.txt' file  
  sort $F | uniq -c | sort -nr -k1 | head -n1 >> top-words.txt; 
done

GNU parallel

GNU parallel is a command line tool that takes away the need to use for-loops entirely. parallel is extremely powerful and feature filled. Importantly, it lets you run commands across multiple jobs. For example, instead of writing a for loop we can process the text files above using 8 jobs at once with parallel

parallel "sort {} | uniq -c | sort -nr -k1 | head -n1 >> top-words.txt" ::: *.txt

The GNU parallel cheatsheet gives good overview of the capabilities of parallel and how you can use it practically.

Editors

You’ll eventually need to edit some code or files from the terminal. Two options for code editing from the command line are vim and nano. vim can be more difficult to use for a beginner but is very powerful.

vim

Once you’re in vim you can use the i key to enter “input” mode. “input” mode let’s you type in new characters. Once you’ve typed away, save your work and exit by pressing the Esc key to return to normal mode, then type :wq, and finally press Enter. This series of keystrokes can be difficult to get used to which leads to the most famous of StackOverflow questions

nano, however, provides a more user friendly interface.

nano

Connecting to a remote server

Most data processing tasks in bioinformatics require running commands on high-performance compute clusters with lots of cores and RAM. You’ll typically access these servers using ssh or secure shell. We use ssh because it’s encrypted and present on every unix filesystem.

To initialize an ssh connection we use the syntax

ssh username@my.remote.server.org

If you’re accessing the same remote resource often then it’s advisable to do two things (1) add the host to your ssh config file and create an ssh-key. The ssh config file describes default settings for connecting to a remote resource. Use one of the text editors above to create or edit the file located at ~/.ssh/config. The ssh config file specifies the host name, IP address and the user that is accessing the server

Host remote_server
  HostName 111.222.333.44
  User myusername
  Port 22

Once you save this file you can access the remote server using ssh remote_server. However, you’ll notice that every time you do this you’ll need to enter your password. Instead of entering your password every time, a more secure and convenient way is to create an ssh key-pair and add it to the remote server. To create an ssh key you can use the ssh-keygen program.

ssh-keygen

This will ask you to create a password for the ssh key and will generate the key’s random art. Once the private key and public key have been created, you can use the ssh-copy-id command to copy the public key to the remote server. This will ask you for your password on the remote machine. After correctly entering your password it wil copy the public key to the remote server.

ssh-copy-id remote_server

After these steps you will be able to use ssh remote_server and seamlessly login to the remote server without having to enter IP addresses or passwords.

Long running jobs

If you do log onto a remote server chances are you’ll have to run a job that takes a long time - a longer time than you’re willing to sit on your computer for. To make sure that your long running command stays running when you log out of the remote server the easiest way to to use a terminal multiplexer like tmux.

Running tmux new -s long-session will create a new window where you can initialize a long running command. Once this command is running you can ‘detach’ from the session safely without worrying about the process quitting when you log off of the remote server. to detach from a session use Ctl + b, d while in the active session. When you log back onto the remote server you can access the running session with tmux a

Resources