hilda cluster load
vesta cluster load

Hilda cluster

Ganglia monitor

Router administration (internet) or LAN

Description: a simple computing cluster located at Astronomical Institute MFF UK, devoted to dynamics of small bodies in the Solar system (see, e.g., Yarko-site), operating system Linux (Debian Lenny), 11 nodes with 4-core processors (Intel Core Quad 2.8 GHz), total power 120 GHz, disk capacity 11×750 GB = 8.25 TB, price approx 175 kKc.

Software: ssh, mc, gcc, gfortran, fort77, gnuplot, imagemagick, qiv, firefox, mplayer/mencoder, ...

Usage: the cluster is accessible by ssh hilda.troja.mff.cuni.cz. The individual nodes are hilda1, hilda2, hilda3, hilda4, hilda5, hilda6, hilda7, hilda8, hilda9, hilda10, hilda11 - you can use ssh and login to all of them without a password. The NFS-shared home directory is hilda1:/home. There are following scripts prepared to ease the job management:

hildalist [NODES]
List all running jobs on hilda cluster nodes.
hildarun NODE DIRECTORY "JOB < STDIN >> STDOUT"
Run a job on a node.
hildastop [NODES]
Stop all running jobs of the current user.
hildacont [NODES]
Continue all stopped jobs.
hildakill [NODES]
Kill all running jobs.
hildarenice NICE [USER]
Renice all running jobs.
hildastop1 NODE PID
Stop a single job on a given node.
hildacont1 NODE PID
Continue a single job.
hildakill1 NODE PID
Kill a single job.
hildarenice1 NODE PID NICE
Renice a single job.
swiftsplit DIRECTORY NTP DRIVER
Split a swift_* job into several pieces to be run on many processors. See an example of the corresponding hildarun.sh script to run them all.
swiftcont DRIVER
Restart swift_* integrator after powerdown, backup all output files in bak* directory.

Run the commands without arguments to see the full documentation and examples. There are also aliases mpilist, mpirun, mpistop, mpiresumem mpikill and mpirenice to several hilda* commands.

All scripts use ssh as a backend (on the basis of ssh user@node command). Nevertheless, you can also start/stop your jobs manually on individual nodes, using standard unix commands like nohup, kill -s STOP, etc. Please, check the Ganglia monitor above to prevent a huge overload.

To backup your data, you can copy them from ~/ to hilda2:/scratch2 or hilda3:/scratch3. All nodes are installed similarly, in case of a head node failure it is thus easy to reconfigure the router and continue with a different head node. Jobs cannot be stopped on one node and restarted on another automatically.

Acknowledgements: Grant Agency of the Czech Republic (grant 205/08/P196), Research Program MSM0021620860 of the Czech Ministry of Education.

Administration: Miroslav Broz, mira at sirrah.troja.mff.cuni.cz. See also HOWTO.hilda and INSTALL.sh. There are following scripts for administration purposes:

hildaaptitude ARGS
Run aptitude on all nodes.
hildaping
Ping all hilda cluster nodes.
hildascp FILE1 FILE2 ... DIR
Copy files to all nodes.
hildastop_dfmin
Stop all running jobs in case of almost full disk (2 GB); this is started every 15 minutes by cron.

Vesta cluster: there is a similar computing cluster at the Observatory and Planetarium Hradec Kralove, with approx 35 GHz power; see its monitor.

Version: Jun 10th 2010.

Valid HTML 4.01 Transitional