Networker

Re: [Networker] Multiple drives for recovery?

2010-03-11 13:54:52
Subject: Re: [Networker] Multiple drives for recovery?
From: George Sinclair <George.Sinclair AT NOAA DOT GOV>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Thu, 11 Mar 2010 13:53:42 -0500
Tim Mooney wrote:
In regard to: [Networker] Multiple drives for recovery?, George Sinclair...:

Anyone else seen this?

I run a browsable recover (CLI), requiring one full and three incremental tapes. NW loads all tapes (6 drives in tape library) and starts reading from each, rather than starting with the full and working through one tape at a time, to the last incr?

This is on 7.5SP1 on RH Linux. Client is running an older 7.2.2 release.

It's funny that you should bring up something like this.  I've been
meaning to post to the list about a major issue with parallelized
multi-volume recovers.  I don't want to hijack your thread so I'll
post more details in a different thread.

Well, this all started because I ran a browsable recover (CLI recover tool) last night from client A, pointing to an older 7.x server1. The browsetime that I used was from back in January 2010. The recover required 4 tapes (one full, three incrementals). It ran the conventional way by loading each tape one at a time. About 7.5 GB of data was recovered. Everything looked fine. The dates and times all looked consistent. I then created an MD5 hash listing of all the recovered data (file permissions, owner, group, checksum, file size, etc.).

Next, I moved the tapes over to the new 7.5SP1 server2's tape library. I repeated the recovery, from the same client A, and this time it loaded the tapes simultaneously and starts reading from all of them, and I'm thinking: "What the heck!!!????". I don't notice any error messages or overwrite prompts in the recover window, however. The recover completes and indicates the same number of recovered files. **BUT**, when I validated it against the MD5 hash listing from above, it reports a number of directories with new time stamps - new as in the date of the recover, not their original mod times. Only directories had new mod times, but not all of them; some were fine. Otherwise, everything else was identical. I thought it was odd that only the mod times for *certain* directories were not preserved during the recovery, but otherwise, all other files are perfect, as are a number of other directories.

So, today, I repeated the recovery, but I first disabled all but one of the drives to force NW to load tapes one at a time. Eureka! It worked like a champ, and everything validated!

I should also note that before any of the recovers, I generated hash listing of the CFI for client A on both servers, and they were identical except for the directory structure under /nsr/index/db6. Otherwise, all the files were the same and the same number of files. Moreover, nsrinfo for the given date/time produces identical results for both servers.


Yes, we've seen NetWorker parallelize multi-volume recovers.  Most of the
time it works pretty well.  IIRC, this is something that was added in the
7.x series (earlier versions would always serialize volume access).  It
used to be configurable by creating a file in /nsr/debug (do a substring
search of the mailing list archives for striped_recover for more info).

I find it hard to believe that NW can utilize multiple drives. How does it merge and/or munge everything properly? What if you're instead recovering from multiple fulls? How can it temporarily store all that data as disk space could become jeopardized at some point. How does it organize and/or re-conglomerate all that later?

Granted, this is a feature that I've always thought would be nice as it would cut down restore times by many factors if it could recover in parallel, but again, this raises my questions above. I wasn't aware that this feature was ever developed and in use in later versions. I generally watch the GUI when doing restores, and I've always seen tapes loaded one by one on 7.2.x releases, but 7.5SP1 is all new to us.


We have, however, seen a few instances where recover apparently deadlocks
in the striped recovery code.  This happened to us to a couple of times
under 7.2.x or 7.4.x, but we upgraded to 7.5.2 last week and the first big
recover we had to do triggered a deadlock in recovery.  We've had a case
open with EMC about this issue since last Friday.

What do you mean by 'deadlocks'?

Do you think the parallel recovery would most likely explain the weirdness that we see?

I was thinking to try upgrading the client software on client A, but I doubt that has any effect over what the server decides to do on its end in terms of loading those tapes. Also, it seems unlikely that the index itself is somehow the culprit. Granted, it might know which tapes the data is located on, but the server is still gonna handle the loading. Moreover, the jukebox configuration and/or the tape library seems an unlikely suspect as it just does what the server tells it.

If I have to disable drives during multi-tape restores that's gonna be a real pain. sigh ...

George


Tim


--
George Sinclair
Voice: (301) 713-3284 x210
- The preceding message is personal and does not reflect any official or unofficial position of the United States Department of Commerce -
- Any opinions expressed in this message are NOT those of the US Govt. -

To sign off this list, send email to listserv AT listserv.temple DOT edu and type 
"signoff networker" in the body of the email. Please write to networker-request 
AT listserv.temple DOT edu if you have any problems with this list. You can access the 
archives at http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER

<Prev in Thread] Current Thread [Next in Thread>