Networker

Re: [Networker] Staging: Looks like the wrong save sets have been staged

2013-05-07 03:57:56
Subject: Re: [Networker] Staging: Looks like the wrong save sets have been staged
From: Tony Albers <Tony.Albers AT PROACT DOT DK>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Tue, 7 May 2013 07:51:29 +0000
Hi Tam,

I can't say exactly what's going on at your place, but here are some of my 
experiences with staging that might help you out a bit.

I don't think staging sorts the list of eligible savesets after age or 
anything, it just makes a list of what falls in the category you've defined, 
and then starts. This might result in the newest saveset being staged first, or 
in any order as long as they are eligible.

If your staging was interrupted by the tape drive failing(or some other 
reason), you could end up with savesets both on a tape volume and disk volume. 
So the next time staging starts, it tries to stage a saveset from disk to a 
tape volume which already has the saveset, and then fails with a message like: 
"volume not eligible" in daemon.raw or daemon.log

The way staging works is that it is actually a cloning followed by a deletion 
of the cloned savesets from the original volume. So if staging is interrupted 
for some reason, it never gets to the deletion part, and you're left with 
identical savesets on two different volumes (disk and tape). And the next time 
it starts staging it might try to use the same volume again and then fails and 
very often hangs.

Furthermore, if you have backups going directly to tape, and the staging 
decides to start when the tape drive is busy doing a backup job, you staging 
will also hang. This is very likely to happen if you only have one tape drive.

I've actually stopped using staging because of the above mentioned issues. 
Especially at smaller installations with only a single tape drive, this can 
easily foul up and hang pretty much everything. Instead of staging, I'm using 
scheduled cloning and then setting the clone retention times on tape volumes to 
how long I want to keep the data. On the disk device, I've just set browse and 
retention to a couple of weeks(client browse and retention setting). This way I 
get to decide when the data is moved to tape, which is usually a couple of 
times every week. And since the retention times of the saveset clones on the 
disk device are so short, they are removed after a couple of weeks.

HTH

/tony



-----Original Message-----
From: EMC NetWorker discussion [mailto:NETWORKER AT LISTSERV.TEMPLE DOT EDU] On 
Behalf Of tammclaughlin
Sent: Monday, May 06, 2013 5:33 PM
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Subject: [Networker] Staging: Looks like the wrong save sets have been staged

I have an issue with a staging policy where the most recent save sets have been 
staged rather than the oldest.
Let me explain this:


Some background:

Networker 7.6.09
backup to adv_files and then clone to tape staging from adv_files to tape.

staging policy:  start: 95%, stop: 91%, oldest, max days: 7, recover: 5 days: 
check fs: 120minutes

We have a faulty tape drive so currently running on 1 drive until the 
replacement arrives.
As the weekend backups were full backups, on Friday, I forced staging by 
changing the thresholds to ensure I had as much free space as I could get for 
the weekend. This was to allow the backup jobs to be cloned to tape with 
minimal contention if staging set in.

So today I saw that the backups had hung as networker was waiting on a stage 
tape that was 100% full.
When I investigated I found that nsrstage was not running and the filesystems 
were within the threshold limits. 
What was happening was that a job was trying to clone a save set just created 
but the save set was now on a stage tape.
It could not load the stage tape because there was only 1 drive which had the 
"destination" tape for the clone.

So why did the stage tape have most recent save sets?
I looked at the volumes were on the filesystem which seems to give some clues.

filesystem:  /diskbackup1

Total:   16TB,  1.8TB free (currently)

volume: size on disk        monthly backup size
notes:   13T                     6.8T                 
unix:      17G                    300G
linux:      79M                   400G


The most recent save sets staged were from the volumes unix and linux and very 
few from the volume notes. In fact almost all of the unix and linux save sets 
have been staged.
Now some of the largest save sets from notes are 500GB so it's possible that 
all of the linux save sets can be staged in the time it takes to stage just one 
notes save set.

I expected the staging police to compile a list of the oldest stage sets across 
all devices and then move to tape which would mean that the most recent would 
still be on disk. So it seems that staging is selecting save sets in a 
different manner.
Could it be that it treats each  volume separately?
Could it be looking for the oldest save sets in each volume and starts to stage 
them. While still writing the larger notes save sets, it goes back to look at 
other volumes and takes from other volumes as the notes is still busy with such 
a large save set?


Another possibility is that the next filesystem check starts while it is still 
staging and cannot read from the notes volume as it is still being used and 
takes from the unix/linux volume instead and just keep taking until it meets 
the required threshold?


Thanks.

+----------------------------------------------------------------------
|This was sent by tam.mclaughlin AT gmail DOT com via Backup Central.
|Forward SPAM to abuse AT backupcentral DOT com.
+----------------------------------------------------------------------