dnqs

Many Unix jobs are not of an interactive nature and require prolonged periods of computation. It is anti-social to run this type of job on a workstation, thereby monopolising its interactive facilities; instead they should be submitted to the Distributed Network Queuing System (dnqs), where they will be run in a controlled sequence with other batch jobs, unattended by the user.

dnqs consists of a set of queues distinguished by the various requirements of the jobs submitted to them, e.g. machine, execution time. These queues dispatch jobs in turn to the time-shared computers (currently Aidan and Finan) and to a set of other machines which are dedicated to this purpose. Aidan and Finan may run a set of batch jobs concurrently, as well as servicing many other interactive activities. The dedicated machines run only one job at a time, that job having exclusive use of the host machine.

Note that dnqs jobs may be submitted from any ISS Solaris system.

Constructing a batch job
Queues available
Notes
Submitting a batch job - qsub
Output from dnqs jobs
Examples
Other dnqs commands - qstat, qrm, qdate, qacct, quser

Constructing a batch job

In order to use dnqs a job needs to be constructed: this is simply a set of UNIX commands (just as would be typed if the work were being done interactively at a workstation) which are assembled in the correct sequence in a file. However, it should be recognised that the execution of batch jobs lacks the human supervision present in interactive work: if jobs are to run successfully, possible pitfalls should be anticipated in advance and provision made for them: errors cannot be fixed "on the fly".

File names
In particular, jobs using the shared temporary file directories /tmp, /usr/local/tmp1 and /usr/local/tmp2 need to be constructed in such a manner that they do not attempt to create file names which may already have been used by other users. (Using your own login name as part of such filenames is a good way to do this.)

It should also be borne in mind that if several similar jobs are to be submitted to the batch system, precautions should be taken to avoid them clashing in their use of file names, since it is possible that several may execute concurrently. Again, the use of the temporary directories is susceptible to errors of this nature.
Interruptions
Although modern computers are extremely reliable, ISS does not guarantee that failures will never occur. Furthermore, from time to time facilities have to be taken out of service for maintenance. Batch jobs should not depend upon being able to run for the maximum time associated with their queues.

Consequently, whenever possible a job should periodically generate data which can be used to restart it without loss of the computation performed so far.
Search path
In interactive operation, at login there is normally an automatic initialisation process which sets up the search path and possibly other conditions. When a batch job commences, this does not occur and it may be necessary to include in the job file similar initialisation operations. Alternatively, in the absence of an appropriate search path, full path names may be specified.

Queues available

Queues have been set up for jobs which require "small", "medium", "large" and "extra large" amounts of processing time (for example sunp_s, sunp_m, sunp_l, sunp_xl)

Queues available on aidan have names prefixed sunp.
Queues available on the Sun host (v880) dedicated to running DNQS-initiated jobs are prefixed sun750m.

Note that the queues impose a limit by per-process CPU time: this limit may change and/or additional ones may be added at short notice. The number and locations of the queues are also subject to change in the light of experience.

Queuename Max CPU time Max VM (Megabytes) Machine

sunp_s 3 hours 250 aidan (400MHz)

sunp_m 12 hours 250 aidan (400MHz)

sunp_l 24 hours 250 aidan (400MHz)

sunp_xl 5 days 250 aidan (400MHz)

sun750m_m 24 hours 600 batch1 (750MHz)

sun750m_l 7 days 600 batch1 (750MHz)

sun750m_xl 28 days 600 batch1 (750MHz)

Queuename	Max CPU time	Max VM (Megabytes)	Machine
sunp_s	3 hours	250	aidan (400MHz)
sunp_m	12 hours	250	aidan (400MHz)
sunp_l	24 hours	250	aidan (400MHz)
sunp_xl	5 days	250	aidan (400MHz)
sun750m_m	24 hours	600	batch1 (750MHz)
sun750m_l	7 days	600	batch1 (750MHz)
sun750m_xl	28 days	600	batch1 (750MHz)

Notes

The above details are subject to change, due to periodic system upgrades.
ISS must be consulted with respect to jobs requiring more VM than stated above.
The amount of processing time a job requires depends on the speed of the computer on which it is run. Apart from the fact that the Ultra 5 dnqs hosts are currently limited with respect to virtual memory (for system performance reasons), it is difficult to give proper guidance as to which queue is the most appropriate.
It may be useful to collect timing figures for scaled-down trial versions of jobs in order to estimate how much CPU time the real jobs would require on each host. Having done that, it is a matter of selecting the queue which permits that amount of CPU time and which will enable completion within an acceptable elapsed time. Of course, elapsed time depends on how many other users and jobs the computer is servicing at the same time, and it is therefore still impossible to estimate accurately on the time-sharing machines Aidan and Finan.

Submitting a batch job

The actual submission of the job is done using the qsub command, specifying the selected queue and the name of the file containing the job, e.g.

qsub sun750m_m big_calc

This puts the job (contained here in a file called big_calc) into the selected queue (here, sun750m_m) where it will await its turn for execution.

A unique number is assigned to the job in order to investigate its progress and identify its output, e.g.

Your job "big_calc" (935678577) has been submitted to queue sun750m_m.

A number of options are available for the qsub command, and the Unix man command describes these: type

man qsub

Of particular interest is -M which causes the user to be e-mailed with details of the execution of the job.

Output from dnqs jobs

Generally, when a job runs it produces output, which may be directed explicitly to files named in the job or to the standard output and error streams.

For interactive jobs, the standard streams are often displayed on the screen. Of course, this is not possible for batch jobs and so their contents are collected into files which are kept in a directory called dnqs_outputs in the user's home directory. The names of these files contain the job number for identification, e.g.

935678577.stdout and 935678577.stderr

Examples

The directory /usr/local/dnqs/examples contains simple examples of the use of dnqs and the file /usr/local/dnqs/examples/README shows how to try out these examples.

A particular example, qexample, which has its own man page, shows a more sophisticated scheme (for optimising file activity). This should not be contemplated until you have a complete understanding of simple dnqs usage.

Other dnqs commands

A number of other utility commands are available.

qstat	show dnqs queue status, including settings and limits
qrm	remove a dnqs job from the queue
qdate	converts the jobid (number) to a date/time. Returns submission time, not start time.
qacct	display accumulated records of all dnqs jobs
quser	displays records only for the specified user

See their man pages for their purpose and use.

dnqs was developed in the USA and modified for use at Newcastle University. Consequently it may be found that the man pages are not completely accurate; however, all locally written documentation should be authoritative.