[Networker] Staging with a Float

One thing we're keen to do is to hold a few days of on-line backups(for rapid restoration) but to take them to tape periodically in casethe disk subsystem dies or we need the savesets months later. Thisneed has grown with having a decent-sized pool of disk twenty milesaway with a GigE line: we want to back up to the remote site, butkeep local tapes for long-term archive.

We are also in the unusual position that we don't have a tape deviceon the networker server, so the bootstraps are going to a dedicatedadv_file device on the networker server and then being taken to tapeby cloning or staging (as the mood takes us). Our ideal solution isto have the bootstraps on on-site disk, off-site disk and tape withinan hour or so.

Cloning would sort-of work, except we would still have to manage thesize of the disk collection by expiring one of the two clones of thesaveset. And as the tapes are only really an absolute last line ofdefence (we have the live copy on-site and the disk copy off-site) wewould rather run the cloning to tape during the day, when we can fixanything that goes wronmg with the robots, rather than at night, whenthe link is saturated with replication jobs and the robot is left toits own devices.

So far as we can see, automatic cloning always takes place as the jobfinishes. You can't say ``when finished, queue a clone job, and runall the clones in this pool on this trigger''.

Staging, unfortunately, will always delete savesets that have beenstaged. In an ideal would we could have a staging policy in whichthe trigger on saveset age was independent on the trigger oncapacity, so we could say ``stage savesets within X hours of writing,and delete savesets that have been staged if the disk is more than X%full''. At the moment if you set, say, maximum retention time to onehour in a staging policy, it will stage the savesets within an hourbut will delete them from the source even if the volume is 99% empty.

Does anyone have any clean workarounds for this? Or reasons whywhat I want to do is stupid?

I've written the following shell script, which runs out of cron everyfew hours (and desperately needs locking against multipleinvocations: I'll code that this evening) but I can't say the need todo this fills me with joy, and it's full of all sorts of hackyworkarounds. nsrmm -d -y in a shell script is bound to end in tearsone day.



ian

#!/bin/ksh

typeset -r tmp=/tmp/$(basename $0).$$
trap "rm -f $tmp" 0
typeset -r server=backup-srv.ftel.co.uk
typeset -r window=14days
typeset -r retention=7days


# bootstraps from offsite disk to onsite disk and onsite tape
# everything else from offsite disk to onsite tape

for pool in BootstrapStaging DatabasesStaging IncrementalsStaging; do
  case $pool in
     DatabasesStaging) set -A target DatabasesClone ;;

BootstrapStaging) set -A target BootstrapOnlineClonesIndexClones ;;

     IncrementalsStaging) set -A target Incrementals ;;
     *) echo $0: $pool is an unknown pool 1>&2 ; exit 1 ;;
  esac

# scan volumes that are in use recently
  for volume in $(mminfo -s $server \
                  -q "pool=${pool},volaccess>-${window}" \
                  -r volume 2>/dev/null); do
     # skip over the non-.RO shadows (readonly on mminfo doesn't work)
     case $volume in
       *.RO) echo $0: scanning $volume 1>&2 ;;
       *) echo $0: skipping $volume 1>&2; continue ;;
     esac

     # find the savesets for which we have no clones
     # if we have previously tried to clone to two pools, but only one
     # succeeded, this WILL NOT re-clone to the missing pool.

# 3 means one on the staging set, one on the .RO shadow, one ontape

     mminfo -s $server  -q "volume=$volume,copies<3,!incomplete" \
            -r ssid 2> /dev/null |
       sort -u > $tmp

     # if we found anything then clone to the selected pool(s)
     if [[ -s $tmp ]]; then
       for t in ${target[*]}; do

echo $0: saving $(wc -l < $tmp) savesets from $volume to$t... 1>&2

         if [[ $1 = live ]]; then
           nsrclone -b $t -s $server -S -f $tmp
         fi
       done
     else
       echo $0: no cloning work to do for $volume 1>&2
     fi

     # find the staging copies of savesets for which we now have
     # other copies

mminfo -s $server -q "volume=${volume},copies>=3,savetime<-${retention}" \

         -r ssid,cloneid 2> /dev/null | grep -v ssid > $tmp
     # and delete them if required
     if [[ -s $tmp ]]; then
       while read ssid cloneid; do
          echo $0: can delete $ssid/$cloneid 1>&2
          if [[ $1 = live ]]; then
             nsrmm -s $server -d -y -S $ssid/$cloneid
          fi
       done < $tmp
       # tidy up the volumes if we deleted anything
       echo $0: cleaning up $volume 1>&2
       if [[ $1 = live ]]; then
         nsrstage -v -s $server -C -V $volume
       fi
     else
       echo $0: nothing to delete from $volume 1>&2
     fi
   done
done

To sign off this list, send email to listserv AT listserv.temple DOT edu and type 
"signoff networker" in the body of the email. Please write to networker-request 
AT listserv.temple DOT edu if you have any problems with this list. You can access the 
archives at http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER