Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.0k views
in Technique[技术] by (71.8m points)

linux - Shell Script for multithreading a process

I am a Bioinformatician and recently stuck in a problem which requires some scripting to speed up my process. We have a software called PHASE and Command that i type in my command line to fire software is

./PHASE test.inp test.out

where PHASE is the name of the program and test.ip is the input file and test.out is the output file.It takes one core to run the above process which takes approx 3 hours to complete.

Now i have 1000 of input files say test1.inp,test2.inp,test3.inp.....and so on to test1000.inp and want to generate all 1000 output files..test1.out,test2.out.....test100.out using full capacity of my system which has 4 cores.

To use full capacity of my system I want to fire 4 instance of the above script that takes 4 input files like this...and generate 4 different outputs

./PHASE test1.inp test1.out
./PHASE test2.inp test2.out
./PHASE test3.inp test3.out
./PHASE test4.inp test4.out

After each job is finished and output file has been generated the script should again fire up the remaining input files until all are over..

./PHASE test5.inp test5.out
./PHASE test6.inp test6.out
./PHASE test7.inp test7.out
./PHASE test8.inp test8.out 

and so on.....

How to write the script for the above process where the script takes advantage of 4 cores and speed up my process.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

If you have GNU xargs, consider something like:

printf '%s' *.inp | xargs -0 -P 4 -n 1 
  sh -c 'for f; do ./PHASE "$f" "${f%.inp}.out"' _

The -P 4 is important here, indicating the number of processes to run in parallel.

If you have a very large number of inputs and they're fast to process, consider replacing -n 1 with a larger number, to increase the number of inputs each shell instance iterates over -- decreasing shell startup costs, but also reducing granularity and, potentially, level of parallelism.


That said, if you really want to do batches of four (per your question), letting all four finish before starting the next four (which introduces some inefficiency, but is what you asked for), you could do something like this...

set -- *.inp                # set $@ to list of files matching *.imp
while (( $# )); do          # until we exhaust that list...
  for ((i=0; i<4; i++)); do # loop over batches of four...
    # as long as there's a next argument, start a process for it, and take it off the list
    [[ $1 ]] && ./PHASE "$1" "${1%.imp}.out" & shift
  done
  wait                      # ...and wait for running processes to finish before proceeding
done

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
...