[Networker] Splitting up large directory save jobs into smaller pieces (aka. "parallelizing")
2010-03-03 09:14:33
Hi List,
One customer asked me if it is possible to speed up saves of large
directories with many sub-dirs (or files). Unfortunately Networker itself
cannot automatically parallelize saving of directories. Thats quite odd.
So i wrote simple script bash script in order to split up large
directories in smaller (and parallel) save jobs.
This bash script runs as networker command and decodes the arguments
(group name, server name and so on) submitted by the networker server
and invokes multiple save commands: one for each sub-directory under the
submitted directory (attribute "save set" in client configuration).
When started the script it reads the command line arguments and currently
treats the last argument as "to be saved directory". Afterwards it
invokes one save process for every directory found under the directory
specified as "save set attribute".
Within the script there is a attribute which specifies the maximum amount of
parallel save jobs allowed to run. In order to achieve this i use a
small queue:
Every invoked save command is added to a queue with their corresponding
process id. Every second the script checks if the process finnished (by
comparing the process ids in the queue with /proc; if the process id is
not listed in /proc the process is assumes as "finished" and removed from
the queue. The freed slot is afterward occupied by the next "waiting to
be run" save process.
While writing a short prototype i noticed some problems in my prototype:
1. The script does not take multiple directories (specified in the "save
set" attribute in client configuration). This can be fixed. No problem.
2. If the directory specified in the save set attribute contains no
sub-directories the save will run non-parallelized.
3. If the directories specified in the save set attribute contains lets
say one large sub-directoy and one smaller directory the script will split
up work into two save jobs: one for the large directory and one for the
small directory. The estimated time saving wont be that huge because the
larger directory still isnt saved in parallel.
4. Doing incremental backups might be tricky.
Based on these observations i began to rethink my first approach.
Another approach parallelizing save jobs would be doing this on file level
rather than directory level which might work a follows:
1. Create a file list of all files under the specified save set directory
2. Split up this file list into <number of concurrend jobs> files with an
equal number of files in them
3. For every file created in step (2) save them: "save <some arguments> -I
<split file name>"
This approach has some problems as well:
1. Creating the file list can take up some time with a lot of files
2. The file might be quite big (several hundred MB)
I am aware this problem can be solved by using NDMP and some kind of block
based backup approach rather than a file based one. But i´d like to have a
script a hand which automatically splits up large save jobs into smaller ones
(a guess i am not the only one).
Feedbacks? Comments? Any other ideas?
Yours sincerely
Ronny Egner
--
Ronny Egner
RonnyEgner AT gmx DOT de
To sign off this list, send email to listserv AT listserv.temple DOT edu and
type "signoff networker" in the body of the email. Please write to
networker-request AT listserv.temple DOT edu if you have any problems with this
list. You can access the archives at
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER
|
<Prev in Thread] |
Current Thread |
[Next in Thread>
|
- [Networker] Splitting up large directory save jobs into smaller pieces (aka. "parallelizing"),
Ronny Egner <=
|
|
|