Servicios y sistemas informáticos para Internet. Curso 2008-2009
Tabla de contenidos
En algunas ocasiones es necesario ejectuar varios trabajos que tienen dependencias entre ellos. El ejemplo clásico sería aquel que empieza con una tarea que se encarga de dividir los datos del trabajo, continua con múltiples tareas que procesan cada parte de los datos, y termina con una tarea de combina los resultados de cada tarea de procesamiento. Este tipo de dependencias se expresan mediante un DAG (Directed Acylic Graph). En condor este tipo de trabajos se ejecutan mediante DAGMan (DAG Manager).
En el ejemplo que vamos a realizar se tiene una tarea que prepara los datos (setup), cuyo cometido es simplemente imprimir un número. A continuación, varias tareas (work1 y work2) procesarán los datos generados por la primera tarea. En este caso, dividirán el número entre dos. Por último, habrá otra tarea (finalize) que se encargará de combinar los resultados sumándolos.
Antes de empezar el ejercicio, vamos crear un nuevo directorio que lo contenga.
$>
mkdir ~/condor/dag$>
cd ~/condor/dag
Ahora vamos a crear los programas
$>
cat > setup
#!/bin/sh echo $RANDOM
$>
cat > work
#!/bin/sh read num expr $num / 2
$>
cat > finalize
#!/bin/sh sum=0 for f in "$@" do num=`cat $f` sum=`expr $num + $sum` done echo $sum
Le damos permisos de ejecución.
$>
chmod +x setup$>
chmod +x work$>
chmod +x finalize
Verificamos que funciona.
$>
./setup | ./work > work1.out$>
./setup | ./work > work2.out$>
./finalize work1.out work2.out 18364
El programa setup imprime un número de forma aleatoria por la salida estándar. El programa work lee un número por la entrada estándar, lo divide entre dos e imprime el resultado por la salida estándar. El programa finalize recibe como parámetra una lista de ficheros, lee su contenido (debe ser un número), los suma e imprime el resultado.
A continuación, vamos a crear los ficheros con la descripción de los trabajos, uno para cada tarea (setup, work1, work2 y finalize).
$>
cat > setup.sub
Universe = vanilla Executable = setup output = setup.out log = job.log Error = setup.error should_transfer_files = YES when_to_transfer_output = ON_EXIT Queue
$>
cat > work1.sub
Universe = vanilla Executable = work input = setup.out output = work1.out error = work1.error log = job.log should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = setup.out Queue
$>
cat > work2.sub
Universe = vanilla Executable = work input = setup.out output = work2.out error = work2.error log = job.log should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = setup.out Queue
$>
cat > finalize.sub
Universe = vanilla Executable = finalize arguments = work1.out work2.out output = finalize.out error = finalize.error log = job.log should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = work1.out,work2.out Queue
Además, tenemos que crear un fichero donde describamos las dependencias de las tareas.
$>
cat > job.dag
Job setup setup.sub Job work1 work1.sub Job work2 work2.sub Job finalize finalize.sub PARENT setup CHILD work1 work2 PARENT work1 work2 CHILD finalize
Ya sólo nos queda enviar el trabajo y esperar por los resultados.
$>
condor_submit_dag -f job.dag
Checking all your submit files for log file names.
This might take a while...
Done.
-----------------------------------------------------------------------
File for submitting this DAG to Condor : job.dag.condor.sub
Log of DAGMan debugging messages : job.dag.dagman.out
Log of Condor library output : job.dag.lib.out
Log of Condor library error messages : job.dag.lib.err
Log of the life of condor_dagman itself : job.dag.dagman.log
Condor Log file for all jobs of this DAG : /gpfs/home_gridis/ruf/condor/dag/job.log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 145.
-----------------------------------------------------------------------
Vamos a observar en la terminal la ejecución del trabajo.
$>
watch -n 1 condor_q
Para enviar el trabajo hemos utilizado el comando condor_submit_dag que funciona como el condor_submit excepto que espera recibir un fichero con la descripción de las dependencias de las tareas (DAG description file). condor_submit_dag envía un trabajo con el programa condor_dagman que se va a encargar de ejecutar el DAG.
El trabajo condor_dagman se ejecuta en el universo Scheduler. En este universo los trabajos se ejecutan en la máquina de envío (son ejecutados por condor_schedd) y nunca son desalojados.
Cuando termina la ejecución del DAG podemos observar el log.
$>
cat job.log
000 (146.000.000) 01/31 11:25:22 Job submitted from host: <193.206.208.141:9632>
DAG Node: setup
...
001 (146.000.000) 01/31 11:30:13 Job executing on host: <193.206.208.205:9680>
...
005 (146.000.000) 01/31 11:30:13 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
6 - Run Bytes Sent By Job
23 - Run Bytes Received By Job
6 - Total Bytes Sent By Job
23 - Total Bytes Received By Job
...
000 (147.000.000) 01/31 11:30:22 Job submitted from host: <193.206.208.141:9632>
DAG Node: work1
...
000 (148.000.000) 01/31 11:30:22 Job submitted from host: <193.206.208.141:9632>
DAG Node: work2
...
001 (148.000.000) 01/31 11:30:33 Job executing on host: <193.206.208.214:9659>
...
005 (148.000.000) 01/31 11:30:33 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
6 - Run Bytes Sent By Job
38 - Run Bytes Received By Job
6 - Total Bytes Sent By Job
38 - Total Bytes Received By Job
...
001 (147.000.000) 01/31 11:30:33 Job executing on host: <193.206.208.205:9680>
...
005 (147.000.000) 01/31 11:30:33 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
6 - Run Bytes Sent By Job
38 - Run Bytes Received By Job
6 - Total Bytes Sent By Job
38 - Total Bytes Received By Job
...
000 (149.000.000) 01/31 11:30:42 Job submitted from host: <193.206.208.141:9632>
DAG Node: finalize
...
001 (149.000.000) 01/31 11:30:53 Job executing on host: <193.206.208.205:9680>
...
005 (149.000.000) 01/31 11:30:53 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
6 - Run Bytes Sent By Job
98 - Run Bytes Received By Job
6 - Total Bytes Sent By Job
98 - Total Bytes Received By Job
...
Podemos observar también los resultados de las tareas.
$>
cat setup.out 23129$>
cat work1.out 11564$>
cat work2.out 11564$>
cat finalize.out 23128
Comprueba el contenido de los ficheros job.dag.condor.sub, job.dag.dagman.log y job.dag.dagman.out. Contienen información sobre la ejecución del DAG.
$>
cat job.dag.condor.sub # Filename: job.dag.condor.sub # Generated by condor_submit_dag job.dag universe = scheduler executable = /opt/condor/bin/condor_dagman getenv = True output = job.dag.lib.out error = job.dag.lib.err log = job.dag.dagman.log remove_kill_sig = SIGUSR1 # Note: default on_exit_remove expression: # ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2)) # attempts to ensure that DAGMan is automatically # requeued by the schedd if it exits abnormally or # is killed (e.g., during a reboot). on_exit_remove = ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2)) copy_to_spool = False arguments = -f -l . -Debug 3 -Lockfile job.dag.lock -Condorlog /gpfs/home_gridis/ruf/... environment = _CONDOR_DAGMAN_LOG=job.dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0 queue$>
cat job.dag.dagman.log # Filename: job.dag.condor.sub # Generated by condor_submit_dag job.dag universe = scheduler executable = /opt/condor/bin/condor_dagman getenv = True output = job.dag.lib.out error = job.dag.lib.err log = job.dag.dagman.log remove_kill_sig = SIGUSR1 # Note: default on_exit_remove expression: # ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2)) # attempts to ensure that DAGMan is automatically # requeued by the schedd if it exits abnormally or # is killed (e.g., during a reboot). on_exit_remove = ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2)) copy_to_spool = False arguments = -f -l . -Debug 3 -Lockfile job.dag.lock -Condorlog /gpfs/home_ gridis/ruf/condor/dag/job.log -Dag job.dag -Rescue job.dag.rescue environment = _CONDOR_DAGMAN_LOG=job.dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0 queue$>
cat job.dag.dagman.out 1/31 11:25:09 ****************************************************** 1/31 11:25:09 ** condor_scheduniv_exec.145.0 (CONDOR_DAGMAN) STARTING UP 1/31 11:25:09 ** /opt/condor/bin/condor_dagman 1/31 11:25:09 ** $CondorVersion: 7.1.1 Jul 1 2008 PRE-RELEASE-UWCS $ 1/31 11:25:09 ** $CondorPlatform: I386-LINUX_CENTOS45 $ 1/31 11:25:09 ** PID = 31186 1/31 11:25:09 ** Log last touched time unavailable (No such file or directory) 1/31 11:25:09 ****************************************************** 1/31 11:25:09 Using config source: /opt/condor/etc/condor_config 1/31 11:25:09 Using local config sources: 1/31 11:25:09 /var/condor/condor_config.local 1/31 11:25:09 DaemonCore: Command Socket at <193.206.208.141:9642> 1/31 11:25:09 DAGMAN_SUBMIT_DELAY setting: 0 1/31 11:25:09 DAGMAN_MAX_SUBMIT_ATTEMPTS setting: 6 1/31 11:25:09 DAGMAN_STARTUP_CYCLE_DETECT setting: 0 1/31 11:25:09 DAGMAN_MAX_SUBMITS_PER_INTERVAL setting: 5 1/31 11:25:09 allow_events (DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION, DAGMAN_ALLOW_EVENTS) setting: 114 1/31 11:25:09 DAGMAN_RETRY_SUBMIT_FIRST setting: 1 1/31 11:25:09 DAGMAN_RETRY_NODE_FIRST setting: 0 1/31 11:25:09 DAGMAN_MAX_JOBS_IDLE setting: 0 1/31 11:25:09 DAGMAN_MAX_JOBS_SUBMITTED setting: 0 1/31 11:25:09 DAGMAN_MUNGE_NODE_NAMES setting: 1 1/31 11:25:09 DAGMAN_DELETE_OLD_LOGS setting: 1 1/31 11:25:09 DAGMAN_PROHIBIT_MULTI_JOBS setting: 0 1/31 11:25:09 DAGMAN_SUBMIT_DEPTH_FIRST setting: 0 1/31 11:25:09 DAGMAN_ABORT_DUPLICATES setting: 1 1/31 11:25:09 DAGMAN_ABORT_ON_SCARY_SUBMIT setting: 1 1/31 11:25:09 DAGMAN_PENDING_REPORT_INTERVAL setting: 600 1/31 11:25:09 DAGMAN_AUTO_RESCUE setting: 0 1/31 11:25:09 DAGMAN_MAX_RESCUE_NUM setting: 100 1/31 11:25:09 argv[0] == "condor_scheduniv_exec.145.0" 1/31 11:25:09 argv[1] == "-Debug" 1/31 11:25:09 argv[2] == "3" 1/31 11:25:09 argv[3] == "-Lockfile" 1/31 11:25:09 argv[4] == "job.dag.lock" 1/31 11:25:09 argv[5] == "-Condorlog" 1/31 11:25:09 argv[6] == "/gpfs/home_gridis/ruf/condor/dag/job.log" 1/31 11:25:09 argv[7] == "-Dag" 1/31 11:25:09 argv[8] == "job.dag" 1/31 11:25:09 argv[9] == "-Rescue" 1/31 11:25:09 argv[10] == "job.dag.rescue" 1/31 11:25:09 DAG Lockfile will be written to job.dag.lock 1/31 11:25:09 DAG Input file is job.dag 1/31 11:25:09 Rescue DAG will be written to job.dag.rescue 1/31 11:25:09 All DAG node user log files: 1/31 11:25:09 /gpfs/home_gridis/ruf/condor/dag/job.log (Condor) 1/31 11:25:09 Parsing 1 dagfiles 1/31 11:25:09 Parsing job.dag ... 1/31 11:25:09 Dag contains 4 total jobs 1/31 11:25:09 Truncating any older versions of log files... 1/31 11:25:09 Sleeping for 12 seconds to ensure ProcessId uniqueness 1/31 11:25:21 Bootstrapping... 1/31 11:25:21 Number of pre-completed nodes: 0 1/31 11:25:21 Registering condor_event_timer... 1/31 11:25:22 Submitting Condor Node setup job(s)... 1/31 11:25:22 submitting: condor_submit -a dag_node_name' '=' 'setup -a ... 1/31 11:25:22 From submit: Submitting job(s). 1/31 11:25:22 From submit: Logging submit event(s). 1/31 11:25:22 From submit: 1 job(s) submitted to cluster 146. 1/31 11:25:22 assigned Condor ID (146.0) 1/31 11:25:22 Just submitted 1 job this cycle... 1/31 11:25:22 Event: ULOG_SUBMIT for Condor Node setup (146.0) 1/31 11:25:22 Number of idle job procs: 1 1/31 11:25:22 Of 4 nodes total: 1/31 11:25:22 Done Pre Queued Post Ready Un-Ready Failed 1/31 11:25:22 === === === === === === === 1/31 11:25:22 0 0 1 0 0 3 0 1/31 11:30:17 Event: ULOG_EXECUTE for Condor Node setup (146.0) 1/31 11:30:17 Number of idle job procs: 0 1/31 11:30:17 Event: ULOG_JOB_TERMINATED for Condor Node setup (146.0) 1/31 11:30:17 Node setup job proc (146.0) completed successfully. 1/31 11:30:17 Node setup job completed 1/31 11:30:17 Number of idle job procs: 0 1/31 11:30:17 Of 4 nodes total: 1/31 11:30:17 Done Pre Queued Post Ready Un-Ready Failed 1/31 11:30:17 === === === === === === === 1/31 11:30:17 1 0 0 0 2 1 0 1/31 11:30:22 Submitting Condor Node work1 job(s)... 1/31 11:30:22 submitting: condor_submit -a dag_node_name' '=' 'work1 -a ... 1/31 11:30:22 From submit: Submitting job(s). 1/31 11:30:22 From submit: Logging submit event(s). 1/31 11:30:22 From submit: 1 job(s) submitted to cluster 147. 1/31 11:30:22 assigned Condor ID (147.0) 1/31 11:30:22 Submitting Condor Node work2 job(s)... 1/31 11:30:22 submitting: condor_submit -a dag_node_name' '=' 'work2 -a ... 1/31 11:30:22 From submit: Submitting job(s). 1/31 11:30:22 From submit: Logging submit event(s). 1/31 11:30:22 From submit: 1 job(s) submitted to cluster 148. 1/31 11:30:22 assigned Condor ID (148.0) 1/31 11:30:22 Just submitted 2 jobs this cycle... 1/31 11:30:22 Event: ULOG_SUBMIT for Condor Node work1 (147.0) 1/31 11:30:22 Number of idle job procs: 1 1/31 11:30:22 Event: ULOG_SUBMIT for Condor Node work2 (148.0) 1/31 11:30:22 Number of idle job procs: 2 1/31 11:30:22 Of 4 nodes total: 1/31 11:30:22 Done Pre Queued Post Ready Un-Ready Failed 1/31 11:30:22 === === === === === === === 1/31 11:30:22 1 0 2 0 0 1 0 1/31 11:30:37 Event: ULOG_EXECUTE for Condor Node work2 (148.0) 1/31 11:30:37 Number of idle job procs: 1 1/31 11:30:37 Event: ULOG_JOB_TERMINATED for Condor Node work2 (148.0) 1/31 11:30:37 Node work2 job proc (148.0) completed successfully. 1/31 11:30:37 Node work2 job completed 1/31 11:30:37 Number of idle job procs: 1 1/31 11:30:37 Event: ULOG_EXECUTE for Condor Node work1 (147.0) 1/31 11:30:37 Number of idle job procs: 0 1/31 11:30:37 Event: ULOG_JOB_TERMINATED for Condor Node work1 (147.0) 1/31 11:30:37 Node work1 job proc (147.0) completed successfully. 1/31 11:30:37 Node work1 job completed 1/31 11:30:37 Number of idle job procs: 0 1/31 11:30:37 Of 4 nodes total: 1/31 11:30:37 Done Pre Queued Post Ready Un-Ready Failed 1/31 11:30:37 === === === === === === === 1/31 11:30:37 3 0 0 0 1 0 0 1/31 11:30:42 Submitting Condor Node finalize job(s)... 1/31 11:30:42 submitting: condor_submit -a dag_node_name' '=' 'finalize -a ... 1/31 11:30:42 From submit: Submitting job(s). 1/31 11:30:42 From submit: Logging submit event(s). 1/31 11:30:42 From submit: 1 job(s) submitted to cluster 149. 1/31 11:30:42 assigned Condor ID (149.0) 1/31 11:30:42 Just submitted 1 job this cycle... 1/31 11:30:42 Event: ULOG_SUBMIT for Condor Node finalize (149.0) 1/31 11:30:42 Number of idle job procs: 1 1/31 11:30:42 Of 4 nodes total: 1/31 11:30:42 Done Pre Queued Post Ready Un-Ready Failed 1/31 11:30:42 === === === === === === === 1/31 11:30:42 3 0 1 0 0 0 0 1/31 11:30:57 Event: ULOG_EXECUTE for Condor Node finalize (149.0) 1/31 11:30:57 Number of idle job procs: 0 1/31 11:30:57 Event: ULOG_JOB_TERMINATED for Condor Node finalize (149.0) 1/31 11:30:57 Node finalize job proc (149.0) completed successfully. 1/31 11:30:57 Node finalize job completed 1/31 11:30:57 Number of idle job procs: 0 1/31 11:30:57 Of 4 nodes total: 1/31 11:30:57 Done Pre Queued Post Ready Un-Ready Failed 1/31 11:30:57 === === === === === === === 1/31 11:30:57 4 0 0 0 0 0 0 1/31 11:30:57 All jobs Completed! 1/31 11:30:57 Note: 0 total job deferrals because of -MaxJobs limit (0) 1/31 11:30:57 Note: 0 total job deferrals because of -MaxIdle limit (0) 1/31 11:30:57 Note: 0 total job deferrals because of node category throttles 1/31 11:30:57 Note: 0 total PRE script deferrals because of -MaxPre limit (0) 1/31 11:30:57 Note: 0 total POST script deferrals because of -MaxPost limit (0) 1/31 11:30:57 **** condor_scheduniv_exec.145.0 (condor_DAGMAN) EXITING WITH STATUS 0
Para ejercitar los conocimientos sobre condor, se propone ejecutar un trabajo que se encarge de calcular el histograma de las palabras que aparezcan en un libro.
Antes de empezar el nuevo ejercicio, vamos crear un nuevo directorio que lo contenga.
$>
mkdir ~/condor/histo$>
cd ~/condor/histo
Una vez cumplido el tramite organizativo, vamos a descargar algún libro.
$>
wget http://www.atc.uniovi.es/doctorado/6grid/data/quijote.txt --12:07:44-- http://www.atc.uniovi.es/doctorado/6grid/data/quijote.txt => `quijote.txt' Resolving www.atc.uniovi.es... 156.35.151.4 Connecting to www.atc.uniovi.es|156.35.151.4|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 2,130,803 (2.0M) [text/plain] 100%[=================================================================================>] 2,130,803 788.18K/s 12:07:48 (786.01 KB/s) - `quijote.txt' saved [2130803/2130803]$>
wget http://www.atc.uniovi.es/doctorado/6grid/data/biblia.txt --12:51:14-- http://www.atc.uniovi.es/doctorado/6grid/data/biblia.txt => `biblia.txt' Resolving www.atc.uniovi.es... 156.35.151.4 Connecting to www.atc.uniovi.es|156.35.151.4|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 4,076,632 (3.9M) [text/plain] 100%[=================================================================================>] 4,076,632 907.05K/s ETA 00:00 12:51:20 (795.48 KB/s) - `biblia.txt' saved [4076632/4076632]
El programa para calcular el histograma está escrito en python.
$>
cat > histo.py
#!/usr/bin/env python import string, sys from string import punctuation if len(sys.argv) != 2: sys.exit("Usage: histo <file>") file = open(sys.argv[1], "r") filedata = file.read() file.close() filewords=string.split(filedata) words=[] for word in filewords: words.append(word.strip(punctuation).lower()) histogram = {} for word in words: histogram[word] = histogram.get(word, 0) + 1 flist = [] for word, count in histogram.items(): flist.append([count, word]) flist.sort() flist.reverse() for pair in flist: print pair[1], pair[0]
Dale permisos de ejecución y comprueba que funciona correctamente.
$>
chmod +x histo.py$>
./histo.py quijote.txt que 20610 de 18196 y 18153 la 10360 a 9799 en 8205 el 8203 no 6224 los 4744 se 4690 con 4184 por 3894 las 3465 lo 3459 le 3398 su 3352 don 2647 del 2490 me 2344 como 2261 quijote 2175 sancho 2148 es 2104 ...
Además, vamos a crear un programa que nos permita combiar el histograma obtenido de varios libros.
$>
cat > comhisto.py
#!/usr/bin/env python import string, sys histogram = {} for filenum in range(1, len(sys.argv)): file = open(sys.argv[filenum], "r") for line in file.readlines(): fields = string.splitfields(line, ' ') histogram[fields[0]] = histogram.get(fields[0], 0) + int(fields[1]) file.close() flist = [] for word, count in histogram.items(): flist.append([count, word]) flist.sort() flist.reverse() for pair in flist: print pair[1], pair[0]
$>
chmod +x comhisto.py
Crea los ficheros de descripción necesarios para ejecutar tres trabajos: dos que calculen el histograma de los libros quijote.txt y biblia.txt, y un tercero que los combine. Este último trabajo depende de los anteriores, por lo cual será necesario utilizar DAGMan. Configura la salida de los trabajos para que se escriba en los ficheros quijote.histo.out, biblia.histo.out y comhisto.out para el histograma combinado.
Para comprobar que los trabajos se han ejecutado correctamente puedes hacer la siguiente prueba.
$>
grep '^dios ' quijote.histo.out dios 524$>
grep '^dios ' biblia.histo.out dios 4280$>
grep '^dios ' comhisto.out dios 4804