Starting a job in bwVisu

While e.g. cancelling a job is a relatively straight-forward operation of simply invoking scancel <jobid> (after which the watcher will detect that the job is no longer running and remove it from the job list in the database), starting a job is a highly non-trivial operation. It is therefore worthy for the understanding of the middleware to discuss in detail how this works. Starting a job involves the following steps:

  1. The web frontend submits a request to start a job via the middleware REST API. This request includes (detailed information on the API):
    • The name of the user that owns the job
    • The name of the batch script template. The middleware will attempt to load this file from the $INSTALL_DIR/templates directory. A batch script template is a template for generating a submission script for the HPC scheduler. Templates are in the mustache templating format and hence can contain variables (for example, the name of the launched singularity image) that will be substituted by the middleware in the next step.
    • A dictionary of variables and their values that the middleware should replace in the batch script template. Notable variables include the name of the singularity image, the number of nodes or expected runtime of the job.
  2. An authentication token is generated for the job and deposited in the middleware's database. This authentication token is used later on to ensure that only jobs that have been submitted through the bwVisu web frontend are allowed to request port forwarding rules.
  3. The middleware turns the batch script template into an actual batch script by
    • Substituting all variables in the template with the values provided by the web frontend. Because these values enter a shell script, they are sanitized for any characters that could be used to inject additional commands (more information)
    • Expanding lines of the form #KEYHOLE-REQUEST-PORTS N into code that will request iptables forwarding of N ports from the gateway to the compute node. This line is allowed to occur at most once in a batch script template. In particular, the middleware will embed at the location of this line the code of the python keyhole client script using the bash heredoc format. This script will then be extracted and invoked when the bash script is executed, and will then communicate with the keyhole service to negotiate the forwarding of the ports. Requesting port forwarding can only be done once the job is running because before that, it is not clear which nodes the HPC scheduler will assign for this job. After this line, the environment variables KEYHOLE_PORT<I>, KEYHOLE_GATEWAY<I>, KEYHOLE_EXTERNAL_PORT<I> will be defined in the script. <I> corresponds to the index of the port.
    • The environment in which the keyhole client is executed must contain the previously mentioned authentication token. For this purpose, it is also embedded in the bash script.
  4. The job is submitted to the HPC scheduler. This is done by piping the generated bash script directly through ssh into sbatch, to avoid batch scripts containing authentication tokens being stored persistently on the filesystem. The middleware parses the sbatch output to extract the job id and stores it in the database for internal use.
  5. The request has finished at this point, and the job id is returned to the web frontend. The returned job id is however not the Slurm job id, but a bwVisu job id (which the middleware can translate back into the Slurm job id if necessary using the stored id in the database). Distinguishing these ids is necessary because in general, not all jobs that Slurm manages are bwVisu jobs if the cluster is not dedicated exclusively to bwVisu. The middleware therefore must enumerate its jobs independently of Slurm.
  6. Although the request has finished and the web frontend can present to the user that the job was submitted, the job is not yet running. Assume that at some point the HPC scheduler decides to execute the job and executes the batch script. At this point, the keyhole client will be executed and contact the middleware to request forwarding of ports. It will submit the hostnames of the nodes in this request, which requires parsing the SLURM_NODELIST environment variable.
  7. The keyhole service performs an HMAC authentication of the request and compares the supplied authentication token from the request with the stored token for this job in the database. Jobs are only allowed to request port forwarding once to prevent DOS attacks, and the maximum number of forwarded ports per job is also limited for the same reason.
  8. The middleware now tries to find suitable ports that are available for forwarding. For this purpose, it checks the database for entries of already assigned ports. If it has found a port that appears unused, it makes sure by invoking nc -z $HOST $PORT via ssh from a gateway node that indeed no service is listening on the port. The middleware will attempt to find ports that are unused on all involved hosts, such that the application can rely on being able to use this port e.g. for all MPI processes on multiple nodes.
  9. iptables is invoked on a gateway node to establish the port forwarding. The database is updated to mark the port as used and the forwarding rule as active.
  10. The port and gateway ip is then returned to the keyhole client executed by the batch script, and the keyhole client will make sure that this information is available as environment variable for the remainder of the batch script.
  11. The batch script continues executing and the actual application starts.
  12. At some point later, the keyhole watcher will detect, again using nc -z, that an assigned port of the job is now open and the application is listening for connections. The keyhole watcher marks the job as 'available', and the middleware will inform the web frontend upon the next query that the user is now able to connect to the application using the assigned port and gateway ip.
  13. The web frontend shows the user that the job is available and displays information on how to connect.