In this tutorial, you will learn everything you need to learn about storage on the RC cluster.

1 - Your Home Directory

Your Home Directory (or homedir) is what you see when you login: /home/<your_username>. You can store datasets, code, and any other files that your collaborators don’t need access to in your home directory. If you have a shared directory for your research project, you should be storing your code, data, files, etc. there.

These files are only accessible by you. DO NOT change the permission on your home directory to allow others to access your files. If you need shared storage, you can request it using the form linked below.

1.1 - Your Home Directory Is For

Data that is for you alone, such as:

  • SSH keys
  • Custom software environments (e.g. conda)
  • Configuration files (e.g. .bashrc)
  • Folders that ondemand.rc.rit.edu creates

1.2 - Storage Quotas

By default, your homedir has a 256GB storage quota. You can see how much storage you have left by running the following:

$ df -h ~
Filesystem                                 Size  Used Avail Use% Mounted on
<sanitized_IP_addresses>:/home/<username>  1.0T  722G  278G  36% /home/<username>

Note: ~ is shorthand for your home directory. ~ is equivalent to /home/<username>.

You can ignore all those IP addresses at the beginning. What you care about are the Size, Used, and Avail fields:

  • Size tells you what your quota is set to. In this example, 1TB.
  • Used tells you how much of your quota has been used. In this example, 722GB.
  • Avail tells you how much more space is left of your quotas. In this example, 278GB.

1.3 - Home Directory Cleanup

If your quota is almost full, start by checking how full your trash folder is:

$ ls -lah /home/<username>/.local/share/Trash       
total 0
drwx------  5 <username> student    0 Jul 31  2023 .
drwx------ 18 <username> student 124G Apr 10 16:44 ..
drwx------  2 <username> student    0 Jul 31  2023 expunged
drwx------  2 <username> student 124G Jul 31  2023 files
drwx------  2 <username> student    0 Jul 31  2023 info

In this example, the researcher has 124G of deleted files in their trash folder. After verifying that there is nothing you care about in your trash folder, you can empty it with the following command:

$ rc -rf /home/<username>/.local/share/Trash

After empything your trash folder, check for any large directories in your homedir:

$ ls -lah /home/<username>
total 752M
drwx------ 22 <username> root    1.4T Apr 11 03:25 .
drwxr-xr-x 41 root       root       0 Apr 11 09:37 ..
drwxr-xr-x  3 <username> student  31G Sep 18  2021 3840
drwxr-xr-x  3 <username> student  38G Feb  5 22:32 API_Algo_Selection
drwxr-xr-x  5 <username> student 5.3G Oct 23  2021 API_Migration
-rw-r--r--  1 <username> student 476M Sep 19  2021 API_Migration.tar.gz
-rw-r--r--  1 <username> student 271M Sep 27  2021 API Migration Transfer.tar.gz
drwxr-xr-x  3 <username> student  32G Sep 18  2021 back
-rw-------  1 <username> student  25K Apr 11 06:36 .bash_history
-rw-r--r--  1 <username> root     385 May 12  2020 .bash_profile
-rw-r--r--  1 <username> root     461 May 12  2020 .bashrc
drwx------  5 <username> student  61M Feb  5 22:47 .cache
drwx------  4 <username> student  679 Feb  5 22:47 .config
drwxr-xr-x  2 <username> student    0 Mar 11 13:34 .dirnews
-rw-r--r--  1 <username> student 1.5M Feb 16  2021 file_history_1.txt
-rw-r--r--  1 <username> student 1.5M Feb 16  2021 file_history.txt
-rw-r--r--  1 <username> student 8.3K Feb  1  2021 holdingpattern.txt
drwxr-xr-x  3 <username> student  29K Nov 17 15:59 .ipython
drwxr-xr-x  2 <username> student  120 Feb  7 14:11 .keras
drwxr-xr-x  5 <username> student 141M Nov 17 15:58 .local
drwxr-xr-x  3 <username> student 102G Sep 15  2021 Lp Minimization
drwx------  3 <username> student  48M Mar 12 01:21 .nv
-rw-r--r--  1 <username> student 2.5M Mar 26  2023 pending.txt
drwxr-----  3 <username> student    0 Feb  4  2022 .pki
-rw-------  1 <username> student  44K Mar 29 13:26 .python_history
drwxr-xr-x  3 <username> student 135G Sep 18  2021 R-CASS-ICWS-6144
drwxr-xr-x  8 <username> student 192G Sep 18  2021 R-CASS_Test_Experiments
-rw-------  1 <username> student  263 Nov 17 15:59 remotemachine.json
drwxr-xr-x  2 <username> student 9.9K Sep 18  2021 slurm_examples
drwxr-xr-x  4 <username> student  15M Apr 24  2023 .spack
drwx------  2 <username> student 4.7K Feb 26  2021 .ssh
drwxr-xr-x  6 <username> student 822G Mar 26  2023 Transfer_Learning
drwxr-xr-x  2 <username> student  289 Feb 12 19:17 .vim
-rw-------  1 <username> student  14K Apr 11 03:25 .viminfo

In this example, most of the researcher’s storage is in the Transfer_Learning directory. You may have similar large directories. Check these directories for files that you forgot about that you no longer need (e.g. checkpoints, .zip and .tar.gz archives).

If, after cleaning up, you still need more space, please ask your Advisor/PI to request more storage using this form. This request must come from Advisors/PIs, not from student researchers.

1.4 - What Happens When I Graduate?

When you graduate, you will no longer have access to the cluster, which includes any files in your home directory. Please make plans to copy any files/folders you would like to keep. Your home directory will be deleted after a set period of time.

1.5 - What if I Need to Keep Using the Cluster After I Graduate?

If you need to continue using the cluster after your graduate (for example, if reviewers ask for revisions/additional experiments), an RIT Faculty Researcher must request a sponsored affiliate account on your behalf using this form. It is best to make this request before (or shortly after) you graduate to ensure that your home directory is not deleted before you need it.

2 - Shared Project Directories

You can request a shared directory to facilitate collaboration with other researchers through ColdFront. Shared directories are located in /shared/rc/ on the cluster and can be accessed by all of the researchers working on a shared project. We highly encourage you to make use of shared directories to avoid duplication of efforts and access issues post-graduation. The same commands and cleanup suggestions apply for shared directories.

Shared directories are also great for storing large datasets. One copy of a 500GB dataset is much better than three copies.

Shared directories exist for the duration of a project; they are not indefinite storage for a P.I.’s (or their lab’s) data. When a project ends, data should either be moved to a new project, or moved off of the Cluster.

2.1 - Project Directories Are For

Data that your collaborators need access to, such as:

  • Datasets for your experiments
  • Shared software environments
  • Code for running your experiments
  • Results from your experiments

3 - Shared Datasets

We maintain a collection of commonly-used datasets in /shared/rc/datasets/. Datasets stored here do not count against your home directory or project directory quotas. Any researcher can access the datasets stored in /shared/rc/datasets/. Access is Read-Only. If your dataset meets the following criteria, you can request that we keep a copy in /shared/rc/datasets/:

  • The dataset is commonly-used throughout your research domain, or multiple domains (e.g. ImageNet)
  • You only need Read access to the dataset
  • The dataset does not need to be updated more frequently than once per year

Note: Datasets in /shared/rc/datasets/ are only accessible on the cluster; this is not a place for you to store datasets you want to release as part of a publication.

4 - Transferring Files to/from the Cluster

4.1 - OnDemand Web Portal

If you need to transfer a small set of files/folders from your laptop/workstation to the cluster, you can use OnDemand to do that. After logging in, you can select Home Directory from the Files tab and then use the Upload/Download buttons.

If you need to download/upload files from Google Drive or another website, you can launch a desktop session from OnDemand and access a web browser from there.

4.2 - Command Line

If your laptop/workstation is Mac or Linux, you can use the scp command to copy files to the cluster:

scp ~/fileDirectory/fileToCopy.py <rit_username>@sporcsubmit.rc.rit.edu:~/fileToCopy.py

To copy a whole directory of work to your Research Computing computer use the -r (recursive) option:

scp -r ~/fileDirectory/ <rit_username>@sporcsubmit.rc.rit.edu:~/DirectoryName

To do the reverse and copy a file from the cluster to your computer:

scp <rit_username>@sporcsubmit.rc.rit.edu:~/DirectoryName/fileToCopy.py

4.3 - Globus

Globus is and advanced tool for secure and reliable research data management. We have a separate Globus Tutorial.

4.4 - Other Options

There are other options for file transfer, but we only test and maintain OnDemand, scp, and Globus. If you choose to use a different option for file transfer, you are responsible for troubleshooting any issues that arise.

5 - Backups

Research Computing has resilient storage, meaning we can lose a few harddrives without losing any data. We do not back up your data. You are responsible for backing up any data that you care about. Two copies of the data on the cluster is not a backup.

6 - Speeding Up Your Compute

Home directories and shared directories are stored on good old harddrives, not SSDs. If your compute jobs rely on a lot of data, then you might see some speedup by adjusting your workflow to use different storage.

Note: Your home directory is located on a networked filesystem. Accessing your homedir from cluster nodes takes slightly longer than if you were just accessing files on your laptop. Most of the time, you won’t even notice this slowdown, but if you are working with lots of files or doing I/O intensive compute, storing your data in your homedir might be a bottleneck. In those cases, using scratch space is recommended.

6.1 - Scratch Space

Scratch space is temporary storage for files. If you have transient data or need lower latency, then you may see some speedup using scratch space. Note: Scratch is also located on a networked filesystem.

/scratch is a cluster file system (NVMe) accessible from every node on the cluster.

You can modify your workflow in your sbatch scripts to access file from /scratch.

We recommend creating a directory in /scratch to store your files in, e.g. mkdir /scratch/<username>. If you do that, then your sbatch scripts should look something like this:

#SBATCH configurations here

spack env activate <environment_name>

mkdir /scratch/<username>
cp /home/<username>/<path_to_data> /scratch/<username>/

# Your code here

mv /scratch/<username>/<path_to_results> /home/<username>/
rm -rf /scratch/<username>

Scratch space is deleted periodically, so make sure you copy what you need and clean up after yourself.

6.2 - Temporary Storage Within Jobs

Inside your jobs, you have access to a temporary /tmp directory. This /tmp directory allows for faster access to files than if you were using /scratch or your homedir.

You can modify your workflow in your sbatch scripts to copy the files you need to /tmp before your computation, then move your files back to your home directory (or shared directory) after your computation finishes. If you are using /tmp, you need to move your files out of /tmp as part of your job, since you can only access /tmp inside your job while it is running.

Your sbatch scripts would look something like this:

#SBATCH configurations here

spack env activate <environment_name>

cp /home/<username>/<path_to_data> /tmp/

# Your code here

mv /tmp/<path_to_results> /home/<username>/

6.2 - What if Scratch Space isn’t Speeding Up My Workflow?

Recently, we have seen a lot of researchers storing their data in tons and tons of tiny files. This is really inefficient and can be a big bottleneck, especially when working with GPUs. Reducing the number of files that contain your data can result in improved performance. There are many libraries out there that can help you store your data more efficiently, such as HDF5, PyTables, and Memory Mapped Numpy Files.

For example, let’s say you have a a dataset that is spread across 12,000 files. Behind the scenes, your code is opening, reading, (maybe) writing, and closing every single one of those files. That’s a lot of overhead slowing you down. We tested with a toy example, and it took about 1 minute and 17 seconds to read all of that data from disk. We used HDF5 to store the whole dataset in one file, and now we can read the data in just under 3 seconds! Here’s a comparison:

12,000 Files:
real    1m16.916s
user    0m2.091s
sys     0m3.329s

1 HDF5 File:
real    0m2.962s (~96.5% speedup)
user    0m2.146s
sys     0m2.240s

1 Memory Mapped Numpy File:
real    0m0.246s (~99.7% speedup)
user    0m1.627s
sys     0m2.099s

1 PyTables File:
real    0m0.941s (~98.8% speedup)
user    0m1.656s
sys     0m2.219s

This is just a small example. Imagine the performance improvement if your dataset is 100,00 files, or 1,000,000 files!

Note: If you see that your GPU utilization is low (e.g. 2%) and choppy, your bottleneck is likely lots of tiny files!



Tags: instructions