Networker

[Networker] Splitting up large directory save jobs into smaller pieces (aka. "parallelizing")

2010-03-03 09:14:33
Subject: [Networker] Splitting up large directory save jobs into smaller pieces (aka. "parallelizing")
From: Ronny Egner <RonnyEgner AT GMX DOT DE>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Wed, 3 Mar 2010 15:12:15 +0100
Hi List,

One customer asked me if it is possible to speed up saves of large 
directories with many sub-dirs (or files). Unfortunately Networker itself 
cannot automatically parallelize saving of directories. Thats quite odd. 

So i wrote  simple  script bash script  in order to split up large 
directories in  smaller (and parallel) save jobs. 


This bash script runs as networker command and decodes the arguments 
(group name,  server name and so on) submitted by the networker server 
and invokes multiple save  commands: one for each sub-directory under the 
submitted directory  (attribute "save set" in client configuration).


When started the script it reads the command line arguments and currently 
treats the  last argument as "to be saved directory". Afterwards it 
invokes one  save process for every directory found under the directory 
specified as  "save set attribute".

Within the script there is a attribute which specifies the maximum amount of 
parallel save jobs allowed to run. In order to achieve this i use a 
small queue:
Every invoked save command is added to a queue with their corresponding 
process id.  Every second the script checks if the process finnished (by 
comparing the process ids in the queue with /proc; if the process id is 
not listed in /proc the process is assumes as "finished" and removed from 
the queue. The freed slot is afterward occupied by the next "waiting to 
be run" save process.


While writing a short prototype i noticed some problems in my prototype:

1. The script does not take multiple directories (specified in the "save 
set"  attribute in client configuration). This can be fixed. No problem.

2. If the directory specified in the save set attribute contains no 
sub-directories the save will run non-parallelized.

3. If the directories specified in the save set attribute contains lets 
say one large sub-directoy and one smaller directory the script will split 
up work into two save jobs: one for the large directory and one for the 
small directory. The estimated time saving wont be that huge because the 
larger directory still isnt saved in parallel.

4. Doing incremental backups might be tricky.



Based on these observations i began to rethink my first approach.


Another approach parallelizing save jobs would be doing this on file level
rather than directory level which might work a follows:

1. Create a file list of all files under the specified save set directory
2. Split up this file list into <number of concurrend jobs> files with an 
equal number of files in them
3. For every file created in step (2) save them: "save <some arguments> -I 
<split file name>"


This approach has some problems as well:

1. Creating the file list can take up some time with a lot of files
2. The file might be quite big (several hundred MB)


I am aware this problem can be solved by using NDMP and some kind of block 
based backup approach rather than a file based one. But i´d like to have a 
script a hand which automatically splits up large save jobs into smaller ones 
(a guess i am not the only one). 


Feedbacks? Comments? Any other ideas?



Yours sincerely
Ronny Egner
-- 
Ronny Egner
RonnyEgner AT gmx DOT de

To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type "signoff networker" in the body of the email. Please write to 
networker-request AT listserv.temple DOT edu if you have any problems with this 
list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER