$title = "Batch job submission - dnqs"; $area = "Unix Support"; $metadata = "unix, jobs, commands, dnqs, batch, queue, qsub, qstat, qrm, qdate, qexample, quser, qacct"; $pfloc = ""; require '/usr/local/wwwdocs/ucs/fragments/header.phtml'; require '/usr/local/wwwdocs/ucs/fragments/header-bc.phtml'; ?>
Many Unix jobs are not of an interactive nature and require prolonged periods of computation. It is anti-social to run this type of job on a workstation, thereby monopolising its interactive facilities; instead they should be submitted to the Distributed Network Queuing System (dnqs), where they will be run in a controlled sequence with other batch jobs, unattended by the user.
dnqs consists of a set of queues distinguished by the various requirements of the jobs submitted to them, e.g. machine, execution time. These queues dispatch jobs in turn to the time-shared computers (currently Aidan and Finan) and to a set of other machines which are dedicated to this purpose. Aidan and Finan may run a set of batch jobs concurrently, as well as servicing many other interactive activities. The dedicated machines run only one job at a time, that job having exclusive use of the host machine.
Note that dnqs jobs may be submitted from any ISS Solaris system.
In order to use dnqs a job needs to be constructed: this is simply a set of UNIX commands (just as would be typed if the work were being done interactively at a workstation) which are assembled in the correct sequence in a file. However, it should be recognised that the execution of batch jobs lacks the human supervision present in interactive work: if jobs are to run successfully, possible pitfalls should be anticipated in advance and provision made for them: errors cannot be fixed "on the fly".
File names
In particular, jobs using the shared temporary file directories /tmp, /usr/local/tmp1
and /usr/local/tmp2 need to be constructed in such a manner that
they do not attempt to create file names which may already have been used
by other users. (Using your own login name as part of such filenames is
a good way to do this.)
It should also be borne in mind that if several similar jobs are to be submitted to the batch system, precautions should be taken to avoid them clashing in their use of file names, since it is possible that several may execute concurrently. Again, the use of the temporary directories is susceptible to errors of this nature.
Interruptions
Although modern computers are extremely reliable, ISS does not guarantee
that failures will never occur. Furthermore, from time to time facilities
have to be taken out of service for maintenance. Batch jobs should not depend
upon being able to run for the maximum time associated with their queues.
Consequently, whenever possible a job should periodically generate data which can be used to restart it without loss of the computation performed so far.
Search path
In interactive operation, at login there is normally an automatic initialisation
process which sets up the search path and possibly other conditions. When
a batch job commences, this does not occur and it may be necessary to include
in the job file similar initialisation operations. Alternatively, in the
absence of an appropriate search path, full path names may be specified.
Queues have been set up for jobs which require "small", "medium", "large" and "extra large" amounts of processing time (for example sunp_s, sunp_m, sunp_l, sunp_xl)
Queues available on aidan have names prefixed sunp.
Queues available on the Sun host (v880) dedicated to running DNQS-initiated jobs are prefixed sun750m.
Note that the queues impose a limit by per-process CPU time: this limit may change and/or additional ones may be added at short notice. The number and locations of the queues are also subject to change in the light of experience.
Queuename | Max CPU time | Max VM (Megabytes) | Machine |
---|---|---|---|
sunp_s | 3 hours | 250 | aidan (400MHz) |
sunp_m | 12 hours | 250 | aidan (400MHz) |
sunp_l | 24 hours | 250 | aidan (400MHz) |
sunp_xl | 5 days | 250 | aidan (400MHz) |
sun750m_m | 24 hours | 600 | batch1 (750MHz) |
sun750m_l | 7 days | 600 | batch1 (750MHz) |
sun750m_xl | 28 days | 600 | batch1 (750MHz) |
The above details are subject to change, due to periodic system upgrades.
ISS must be consulted with respect to jobs requiring more VM than stated above.
The amount of processing time a job requires depends on the speed of the computer on which it is run. Apart from the fact that the Ultra 5 dnqs hosts are currently limited with respect to virtual memory (for system performance reasons), it is difficult to give proper guidance as to which queue is the most appropriate.
It may be useful to collect timing figures for scaled-down trial versions of jobs in order to estimate how much CPU time the real jobs would require on each host. Having done that, it is a matter of selecting the queue which permits that amount of CPU time and which will enable completion within an acceptable elapsed time. Of course, elapsed time depends on how many other users and jobs the computer is servicing at the same time, and it is therefore still impossible to estimate accurately on the time-sharing machines Aidan and Finan.
The actual submission of the job is done using the qsub command, specifying the selected queue and the name of the file containing the job, e.g.
qsub sun750m_m big_calc
This puts the job (contained here in a file called big_calc) into the selected queue (here, sun750m_m) where it will await its turn for execution.
A unique number is assigned to the job in order to investigate its progress and identify its output, e.g.
Your job "big_calc" (935678577) has been submitted to queue sun750m_m.
A number of options are available for the qsub command, and the Unix man command describes these: type
man qsub
Of particular interest is -M which causes the user to be e-mailed with details of the execution of the job.
Generally, when a job runs it produces output, which may be directed explicitly to files named in the job or to the standard output and error streams.
For interactive jobs, the standard streams are often displayed on the screen. Of course, this is not possible for batch jobs and so their contents are collected into files which are kept in a directory called dnqs_outputs in the user's home directory. The names of these files contain the job number for identification, e.g.
935678577.stdout and 935678577.stderr
The directory /usr/local/dnqs/examples contains simple examples of the use of dnqs and the file /usr/local/dnqs/examples/README shows how to try out these examples.
A particular example, qexample, which has its own man page, shows a more sophisticated scheme (for optimising file activity). This should not be contemplated until you have a complete understanding of simple dnqs usage.
A number of other utility commands are available.
qstat |
show dnqs queue status, including settings and limits |
qrm |
remove a dnqs job from the queue |
qdate |
converts the jobid (number) to a date/time. Returns submission time, not start time. |
qacct |
display accumulated records of all dnqs jobs |
quser |
displays records only for the specified user |
See their man pages for their purpose and use.
dnqs was developed in the USA and modified for use at Newcastle University. Consequently it may be found that the man pages are not completely accurate; however, all locally written documentation should be authoritative.
require '/usr/local/wwwdocs/ucs/fragments/footer.phtml'; ?>