Using the HIVE Infrastructure – HuBMAP Consortium

Using the HuBMAP HIVE Infrastructure

The HuBMAP HIVE infrastructure environment is always evolving. In order to build the environment that you need for it to be successful, we want to hear from you. You can request specific software to be installed or tell us what other resources your work requires by emailing help@hubmapconsortium.org. Don't be shy! No request is too small.

This document will change and expand as the test infrastructure changes and expands. Be sure to check it frequently for updates.

In this document:

What is the HuBMAP HIVE Infrastructure?
Request an account
HIVE cluster
Getting help

What is the HuBMAP HIVE Infrastructure?

The HuBMAP HIVE Infrastructure is to be used by HuBMAP members who are creating tools for the HuBMAP project. There are two resources available: the HIVE cluster and PSC’s Bridges-2 supercomputing system. Most development will be done on the HIVE cluster; our expectation is that Bridges-2 will be used when developers want to work via the command line.

Request an account

Request an account by submitting this form https://grants.psc.edu/cgi-bin/hubmap/add_users.pl.

Once approved, you will receive an email with account information and login instructions, with access to the HIVE cluster and to Bridges-2.

The HIVE cluster

The sections below pertain to the HIVE cluster.

HIVE cluster configuration

The HIVE cluster currently includes:

a login node
two 60-core, 3TB RAM nodes (“CPU nodes”)
one GPU node with an NVIDIA K80 GPU card (“GPU node”)
two data transfer nodes (DTNs)

The login node also hosts user-specific VMs. Do not run computational jobs on the login node - it has limited resources and is shared among all currently logged-in users.

The CPU nodes (l001 and l002) and GPU node (gpu000) are available for computational jobs. All nodes and VMs are running CentOS 7.6 unless otherwise specified.

The DTNs facilitate bulk data transfers to and from the cluster's filesystems. See the Data Transfer section below.

HIVE cluster compute nodes
	CPU nodes	GPU nodes
Name	l001, l002	gpu000	a100
CPUs	4 Intel Xeon E7-4480 v2 CPUs; 15 cores/CPU; 2.50GHz	2 Intel Broadwell CPUs; 16 cores/CPU; 2.50GHz	64 AMD EPYC 7543
RAM	3TB	125GB	2TB
Cache	38.4MB	25.6MB
GPUs	N/A	2 NVIDIA P100 GPUs; 16GB memory/GPU	8 NVIDIA A100 GPUs; 80GB memory/GPU

Accessing the HIVE cluster

Use SSH to connect to a login node at hive.psc.edu.

ssh hive.psc.edu

Upon successful login, a bash command-line shell session is started for you on the login node in your user account. See more on using ssh pn PSC systems at https://www.psc.edu/about-using-ssh.

Do not run computational jobs on the login node. Its resources are limited and are shared among all currently logged-in users.

HIVE cluster file systems

A 450TB shared, Lustre parallel file system is mounted at /hive on all nodes and VMs. The /hive directory structure currently includes: <//p>

Purpose	Path
User $HOME directories	/hive/users
Software package installation directories	/hive/packages
Software environment module directories	/hive/modulefiles
HubMAP Data Landing Zone directories	/hive/hubmap/lz/group, where group is one of: Broad Institute RTI California Institute of Technology TMC Cal Tech TTD General Electric RTI Harvard TTD IEC Testing Group Northwestern RTI Purdue TTD sample Stanford Backups Stanford RTI Stanford-snevins Stanford TMC Stanford TTD tedz-share testing tissue-reg-data tissue-reg-save Stanford TMC University of California San Diego TMC University of Florida sample University of Florida TMC Vanderbilt TMC
HuBMAP Data Archive directories	/hive/hubmap/data</td.

Data transfer to the HIVE cluster

For small volumes of data (100MB or less), use scp or sftp to transfer data to/from the login nodes at hive.psc.edu. For larger transfers, you should use scp, sftp, or rsync to the cluster's Data Transfer Nodes (DTNs), which are available at data.hive.psc.edu. The DTNs are separate from the login nodes and mount the same file systems as the login nodes, but have higher-bandwidth network interfaces.

Software on the HIVE cluster

Some common software packages have been installed on the HIVE cluster for your use.

The environment management package Module is essential for running software on the HIVE cluster. The module file for a given package defines paths and variables for you that are needed to use the package.

The list of installed software that you can add to your shell environment is available by typing:

module avail

To load the environment for a software package, type:

module load package_name

where package_name is the name of an available software package. Several versions of software packages may be available; if you require a specific version other than the default one, be sure to specify it by adding the version number after the package_name, e.g.,

module load foo/2.1

If you would like additional software to be installed, please email help@hubmapconsortium.org.

Containers on the HIVE cluster

The HIVE infrastructure supports Singularity containers. To add singularity commands to your shell environment or in your SLURM job script, use the command:

module load singularity

To use Docker containers, you will need to import them into Singularity format. Please consult the Singularity documentation at https://sylabs.io/docs/ for details - the user documentation includes a section dedicated to Singularity and Docker.

Virtual Machines on the HIVE cluster

If you need a VM, one can be set up for you. To request a VM, send email to help@hubmapconsortium.org. Include this information:

Name
Username on hive.psc.edu
Email
Phone
Purpose of VM
Software needed
Storage requirements
How many cores are needed
How much RAM is needed
When you need the VM to be available (start date)
How long do you need the VM to remain in operation (end date)
How many of these identical VMs do you need
Do you need a network port opened
Any other information needed to prepare the VM

Running jobs on the HIVE cluster

The SLURM scheduler manages the batch jobs run on the HIVE cluster. From the login node you can use SLURM scheduler commands (e.g., sbatch, salloc, srun) to submit jobs or to initiate tasks on compute nodes.

You can run in interactive mode or batch mode.

Partitions on the HIVE cluster

Two partitions (queues) are set up on the HIVE cluster for jobs to run in: the batch partition, which manages the CPU nodes and is the default, and the GPU partition, for jobs using the GPU node. The -p option to the sbatch command indicates which partition you want to use.

Interactive mode

Use the srun command to run interactively. Using options to the srun command, you must specify the type and number of nodes that you need, a time limit, and the shell to use. For interactive jobs, you must always use the --pty option to srun.

The format for an srun command is:

srun --pty --time=hh:mm:ss --nodes=nnodes --ntasks-per-node=ncpus shell-name

To use any of the GPU nodes, you also need to specify the GPU partition, the type of GPU node to use, and the number of GPUs you need. For example, an srun command to use one CPU and one GPU of the A100 GPU node for one hour would be:

srun --pty --time=01:00:00 --nodes=1 --ntasks-per-node=1 -p GPU --gpus=a100:1 bash

See the table of srun and sbatch options for more information.

Batch mode

To run a batch job, you must first create a batch script, and then submit the script to a partition (queue) using the sbatch command. When the job completes, by default output is written to a file named slurm-jobid.out in the directory from which the job was submitted.

A batch script is a file that consists of SBATCH directives, executable commands and comments.

SBATCH directives specify your resource requests and other job options in your batch script. The SBATCH directives must start with '#SBATCH' as the first text on a line, with no leading spaces.

Comments begin with a '#' character.

The first line of any batch script must indicate the shell to use for your batch job, using the format:

#!/bin/shell-name

For example:

 #!/bin/bash

You can also specify resource requests and options on the sbatch command line. Any options on the command line take precedence over those given in the batch script.

The sbatch command

To submit a batch job, use the sbatch command. The format is:

sbatch -options batch-script

The options to sbatch can either be in your batch script or on the sbatch command line. Options in the command line override those in the batch script. See the table below for many of the common options.

Slurm will return a message that your job has been submitted successfully and identifying the job-id:

Submitted batch job nnn

Example sbatch commands

To run on one of the GPU nodes, use an sbatch command of the form:

sbatch --time=walltime --nodes=nnodes --ntasks-per-node=ncpus -p GPU --gpus=gpu_node_type:ngpus batch-script

This sbatch command submits a job to run the script batch.script on the A100 GPU node, using one CPU and one GPU, for one hour:

sbatch --time=01:00:00 -- nodes=1 --ntasks-per-node=1 -p GPU --gpus=a100:1 batch.script

This command submits a job to run the script batch.script on the P100 GPU node, using one CPU and one GPU, for one hour:

sbatch --time=01:00:00 -- nodes=1 --ntasks-per-node=1 -p GPU --gpus=P100:1 batch.script

Options to the sbatch command

Common options to the sbatch command are listed below.

Option	Description
-t HH:MM:SS	Walltime requested
-N n	Number of nodes requested
--ntasks-per-node=n Note the "--" for this option	Number of cores to allocate per node
--gpus=gpu_node_type:ngpus Note the "--" for this option	Type and number of GPUs to allocate. gpu_node_type can be "a100" or "P100" ngpus is the number of GPUs to allocate.
--mem=nGB Note the "--" for this option	Amount of memory requested in GB Default is 51.2GB/core requested. See --ntasks-per-node option to request cores
-p GPU	Dictates that the job will run in the GPU partition, using the GPU node. Without this option, the job will run on one of the CPU nodes in the batch partition.
--pty Note the "--" for this option	Must be used for all interactive jobs
-o filename	Standard output and error are written to filename Default filename is slurm-jobid.out
--mail-type=type Note the "--" for this option	Send email when job events occur, where type can be BEGIN, END, FAIL or ALL
--mail-user=user Note the "--" for this option	User to send email to as specified by -mail-type. Default is the user who submits the job.
-d=dependency-list	Set up dependency lists between jobs, where dependency-list can be: after:job_id[:jobid...] This job can begin execution after the specified jobs have begun execution. afterany:job_id[:jobid...] This job can begin execution after the specified jobs have terminated. aftercorr:job_id[:jobid...] A task of this job array can begin execution after the corresponding task ID in the specified job has completed successfully (ran to completion with an exit code of zero). afterok:job_id[:jobid...] This job can begin execution after the specified jobs have successfully executed (ran to completion with an exit code of zero). afternotok:job_id[:jobid...] This job can begin execution after the specified jobs have terminated in some failed state (non-zero exit code, node failure, timed out, etc) singleton This job can begin execution after any previously launched jobs sharing the same job name and user have terminated.
--no-requeue Note the "--" for this option	Specifies that your job will be not be requeued under any circumstances. If your job is running on a node that fails it will not be restarted. Note the "--" for this option.
--time-min=HH:MM:SS Note the "--" for this option	Specifies a minimum walltime for your job in HH:MM:SS format. SLURM considers the walltime requested when deciding which job to start next. Free slots on the machine are defined by the number of nodes and how long those nodes are free until they will be needed by another job. By specifying a minimum walltime you allow the scheduler to reduce your walltime request to your specified minimum time when deciding whether to schedule your job. This could allow your job to start sooner. If you use this option your actual walltime assignment can vary between your minimum time and the time you specified with the -t option. If your job hits its actual walltime limit, it will be killed. When you use this option you should checkpoint your job frequently to save the results obtained to that point.
-h	Help: lists all the available command options

Getting help

Please report any issues or direct any questions to help@hubmapconsortium.org.

Return to Member Portal Home.