In this tutorial, you will learn everything you need to learn about storage on the RC cluster.
1 - Your Home Directory
Your Home Directory (or homedir) is what you see when you login: /home/<your_username>. You can store datasets, code, and any other files that your collaborators don’t need access to in your home directory. If you have a shared directory for your research project, you should be storing your code, data, files, etc. there.
These files are only accessible by you. DO NOT change the permission on your home directory to allow others to access your files. If you need shared storage, you can request it using the form linked below.
1.1 - Your Home Directory Is For
Data that is for you alone, such as:
- SSH keys
- Custom software environments (e.g. conda)
- Configuration files (e.g.
.bashrc) - Folders that ondemand.rc.rit.edu creates
1.2 - Storage Quotas
By default, your homedir has a 256GB storage quota. You can see how much storage you have left by running the following:
$ df -h ~
Filesystem Size Used Avail Use% Mounted on
<sanitized_IP_addresses>:/home/<username> 1.0T 722G 278G 36% /home/<username>
Note: ~ is shorthand for your home directory. ~ is equivalent to /home/<username>.
You can ignore all those IP addresses at the beginning. What you care about are the Size, Used, and Avail fields:
Sizetells you what your quota is set to. In this example, 1TB.Usedtells you how much of your quota has been used. In this example, 722GB.Availtells you how much more space is left of your quotas. In this example, 278GB.
1.3 - Home Directory Cleanup
If your quota is almost full, start by checking how full your trash folder is:
$ ls -lah /home/<username>/.local/share/Trash
total 0
drwx------ 5 <username> student 0 Jul 31 2023 .
drwx------ 18 <username> student 124G Apr 10 16:44 ..
drwx------ 2 <username> student 0 Jul 31 2023 expunged
drwx------ 2 <username> student 124G Jul 31 2023 files
drwx------ 2 <username> student 0 Jul 31 2023 info
In this example, the researcher has 124G of deleted files in their trash folder. After verifying that there is nothing you care about in your trash folder, you can empty it with the following command:
$ rc -rf /home/<username>/.local/share/Trash
After empything your trash folder, check for any large directories in your homedir:
$ ls -lah /home/<username>
total 752M
drwx------ 22 <username> root 1.4T Apr 11 03:25 .
drwxr-xr-x 41 root root 0 Apr 11 09:37 ..
drwxr-xr-x 3 <username> student 31G Sep 18 2021 3840
drwxr-xr-x 3 <username> student 38G Feb 5 22:32 API_Algo_Selection
drwxr-xr-x 5 <username> student 5.3G Oct 23 2021 API_Migration
-rw-r--r-- 1 <username> student 476M Sep 19 2021 API_Migration.tar.gz
-rw-r--r-- 1 <username> student 271M Sep 27 2021 API Migration Transfer.tar.gz
drwxr-xr-x 3 <username> student 32G Sep 18 2021 back
-rw------- 1 <username> student 25K Apr 11 06:36 .bash_history
-rw-r--r-- 1 <username> root 385 May 12 2020 .bash_profile
-rw-r--r-- 1 <username> root 461 May 12 2020 .bashrc
drwx------ 5 <username> student 61M Feb 5 22:47 .cache
drwx------ 4 <username> student 679 Feb 5 22:47 .config
drwxr-xr-x 2 <username> student 0 Mar 11 13:34 .dirnews
-rw-r--r-- 1 <username> student 1.5M Feb 16 2021 file_history_1.txt
-rw-r--r-- 1 <username> student 1.5M Feb 16 2021 file_history.txt
-rw-r--r-- 1 <username> student 8.3K Feb 1 2021 holdingpattern.txt
drwxr-xr-x 3 <username> student 29K Nov 17 15:59 .ipython
drwxr-xr-x 2 <username> student 120 Feb 7 14:11 .keras
drwxr-xr-x 5 <username> student 141M Nov 17 15:58 .local
drwxr-xr-x 3 <username> student 102G Sep 15 2021 Lp Minimization
drwx------ 3 <username> student 48M Mar 12 01:21 .nv
-rw-r--r-- 1 <username> student 2.5M Mar 26 2023 pending.txt
drwxr----- 3 <username> student 0 Feb 4 2022 .pki
-rw------- 1 <username> student 44K Mar 29 13:26 .python_history
drwxr-xr-x 3 <username> student 135G Sep 18 2021 R-CASS-ICWS-6144
drwxr-xr-x 8 <username> student 192G Sep 18 2021 R-CASS_Test_Experiments
-rw------- 1 <username> student 263 Nov 17 15:59 remotemachine.json
drwxr-xr-x 2 <username> student 9.9K Sep 18 2021 slurm_examples
drwxr-xr-x 4 <username> student 15M Apr 24 2023 .spack
drwx------ 2 <username> student 4.7K Feb 26 2021 .ssh
drwxr-xr-x 6 <username> student 822G Mar 26 2023 Transfer_Learning
drwxr-xr-x 2 <username> student 289 Feb 12 19:17 .vim
-rw------- 1 <username> student 14K Apr 11 03:25 .viminfo
In this example, most of the researcher’s storage is in the Transfer_Learning directory. You may have similar large directories. Check these directories for files that you forgot about that you no longer need (e.g. checkpoints, .zip and .tar.gz archives).
If, after cleaning up, you still need more space, please ask your Advisor/PI to request more storage using this form. This request must come from Advisors/PIs, not from student researchers.
1.4 - What Happens When I Graduate?
When you graduate, you will no longer have access to the cluster, which includes any files in your home directory. Please make plans to copy any files/folders you would like to keep. Your home directory will be deleted after a set period of time.
1.5 - What if I Need to Keep Using the Cluster After I Graduate?
If you need to continue using the cluster after your graduate (for example, if reviewers ask for revisions/additional experiments), an RIT Faculty Researcher must request a sponsored affiliate account on your behalf using this form. It is best to make this request before (or shortly after) you graduate to ensure that your home directory is not deleted before you need it.
2 - Shared Project Directories
You can request a shared directory to facilitate collaboration with other researchers through ColdFront. Shared directories are located in /shared/rc/ on the cluster and can be accessed by all of the researchers working on a shared project. We highly encourage you to make use of shared directories to avoid duplication of efforts and access issues post-graduation. The same commands and cleanup suggestions apply for shared directories.
Shared directories are also great for storing large datasets. One copy of a 500GB dataset is much better than three copies.
Shared directories exist for the duration of a project; they are not indefinite storage for a P.I.’s (or their lab’s) data. When a project ends, data should either be moved to a new project, or moved off of the Cluster.
2.1 - Project Directories Are For
Data that your collaborators need access to, such as:
- Datasets for your experiments
- Shared software environments
- Code for running your experiments
- Results from your experiments
3 - Shared Datasets
We maintain a collection of commonly-used datasets in /shared/rc/datasets/. Datasets stored here do not count against your home directory or project directory quotas. Any researcher can access the datasets stored in /shared/rc/datasets/. Access is Read-Only. If your dataset meets the following criteria, you can request that we keep a copy in /shared/rc/datasets/:
- The dataset is commonly-used throughout your research domain, or multiple domains (e.g. ImageNet)
- You only need Read access to the dataset
- The dataset does not need to be updated more frequently than once per year
Note: Datasets in /shared/rc/datasets/ are only accessible on the cluster; this is not a place for you to store datasets you want to release as part of a publication.
4 - Transferring Files to/from the Cluster
4.1 - OnDemand Web Portal
If you need to transfer a small set of files/folders from your laptop/workstation to the cluster, you can use OnDemand to do that. After logging in, you can select Home Directory from the Files tab and then use the Upload/Download buttons.
If you need to download/upload files from Google Drive or another website, you can launch a desktop session from OnDemand and access a web browser from there.
4.2 - Command Line
If your laptop/workstation is Mac or Linux, you can use the scp command to copy files to the cluster:
scp ~/fileDirectory/fileToCopy.py <rit_username>@sporcsubmit.rc.rit.edu:~/fileToCopy.py
To copy a whole directory of work to your Research Computing computer use the -r (recursive) option:
scp -r ~/fileDirectory/ <rit_username>@sporcsubmit.rc.rit.edu:~/DirectoryName
To do the reverse and copy a file from the cluster to your computer:
scp <rit_username>@sporcsubmit.rc.rit.edu:~/DirectoryName/fileToCopy.py
4.3 - Globus
Globus is and advanced tool for secure and reliable research data management. We have a separate Globus Tutorial.
4.4 - Other Options
There are other options for file transfer, but we only test and maintain OnDemand, scp, and Globus. If you choose to use a different option for file transfer, you are responsible for troubleshooting any issues that arise.
5 - Backups
Research Computing has resilient storage, meaning we can lose a few harddrives without losing any data. We do not back up your data. You are responsible for backing up any data that you care about. Two copies of the data on the cluster is not a backup.
6 - Speeding Up Your Compute
Home directories and shared directories are stored on good old harddrives, not SSDs. If your compute jobs rely on a lot of data, then you might see some speedup by adjusting your workflow to use different storage.
Note: Your home directory is located on a networked filesystem. Accessing your homedir from cluster nodes takes slightly longer than if you were just accessing files on your laptop. Most of the time, you won’t even notice this slowdown, but if you are working with lots of files or doing I/O intensive compute, storing your data in your homedir might be a bottleneck. In those cases, using scratch space is recommended.
6.1 - Scratch Space
Scratch space is temporary storage for files. If you have transient data or need lower latency, then you may see some speedup using scratch space. Note: Scratch is also located on a networked filesystem.
/scratch is a cluster file system (NVMe) accessible from every node on the cluster.
You can modify your workflow in your sbatch scripts to access file from /scratch.
We recommend creating a directory in /scratch to store your files in, e.g. mkdir /scratch/<username>. If you do that, then your sbatch scripts should look something like this:
#SBATCH configurations here
spack env activate <environment_name>
mkdir /scratch/<username>
cp /home/<username>/<path_to_data> /scratch/<username>/
# Your code here
mv /scratch/<username>/<path_to_results> /home/<username>/
rm -rf /scratch/<username>
Scratch space is deleted periodically, so make sure you copy what you need and clean up after yourself.
6.2 - Temporary Storage Within Jobs
Inside your jobs, you have access to a temporary /tmp directory. This /tmp directory allows for faster access to files than if you were using /scratch or your homedir.
You can modify your workflow in your sbatch scripts to copy the files you need to /tmp before your computation, then move your files back to your home directory (or shared directory) after your computation finishes. If you are using /tmp, you need to move your files out of /tmp as part of your job, since you can only access /tmp inside your job while it is running.
Your sbatch scripts would look something like this:
#SBATCH configurations here
spack env activate <environment_name>
cp /home/<username>/<path_to_data> /tmp/
# Your code here
mv /tmp/<path_to_results> /home/<username>/
/tmp, you need to move your files out of /tmp as part of your job, since you can only access /tmp inside your job while it is running. You do not need to empty /tmp because Slurm will delete this when your job finishes.6.2 - What if Scratch Space isn’t Speeding Up My Workflow?
Recently, we have seen a lot of researchers storing their data in tons and tons of tiny files. This is really inefficient and can be a big bottleneck, especially when working with GPUs. Reducing the number of files that contain your data can result in improved performance. There are many libraries out there that can help you store your data more efficiently, such as HDF5, PyTables, and Memory Mapped Numpy Files.
For example, let’s say you have a a dataset that is spread across 12,000 files. Behind the scenes, your code is opening, reading, (maybe) writing, and closing every single one of those files. That’s a lot of overhead slowing you down. We tested with a toy example, and it took about 1 minute and 17 seconds to read all of that data from disk. We used HDF5 to store the whole dataset in one file, and now we can read the data in just under 3 seconds! Here’s a comparison:
12,000 Files:
real 1m16.916s
user 0m2.091s
sys 0m3.329s
1 HDF5 File:
real 0m2.962s (~96.5% speedup)
user 0m2.146s
sys 0m2.240s
1 Memory Mapped Numpy File:
real 0m0.246s (~99.7% speedup)
user 0m1.627s
sys 0m2.099s
1 PyTables File:
real 0m0.941s (~98.8% speedup)
user 0m1.656s
sys 0m2.219s
This is just a small example. Imagine the performance improvement if your dataset is 100,00 files, or 1,000,000 files!
Note: If you see that your GPU utilization is low (e.g. 2%) and choppy, your bottleneck is likely lots of tiny files!