Veritas-bu

Re: [Veritas-bu] My Experiences with DataDomain Restorers

2008-11-08 12:01:26
Subject: Re: [Veritas-bu] My Experiences with DataDomain Restorers
From: Eliza Yam <eyam AT hotmail DOT com>
To: "JAJA (Jamie Jamison)" <jamisonj AT zgi DOT com>, <robin.small AT fresno DOT gov>
Date: Sat, 8 Nov 2008 16:42:34 +0000
Jamie,
I can echo on your comments on Data Domain Restorers as we have used them for a few years now.  Thanks for the comprehensive analysis.  It certainly will help people who are looking for dedup appliances.
 
We also have a DDR460 in our DR site.  We recently bought a DD565 due to capacity growth.  Now the DDR460 is waiting for re-deploy.  We have many disk drive failures in DDR460.  When drives failed, we replaced them without problem.  What concern me now is the midplane board after reading your report.  Do you by any chance know the number of the support bulletin? 
 
Thanks
Eliza
 
 


Date: Fri, 7 Nov 2008 18:08:47 -0800
From: jamisonj AT zgi DOT com
To: Robin.Small AT fresno DOT gov
CC: VERITAS-BU AT mailman.eng.auburn DOT edu
Subject: [Veritas-bu] My Experiences with DataDomain Restorers


I've been using DataDomain restorers, the 460 and 565 series for almost three years now and here's my opinion of them, good and bad.
 
Good:
 
DeDuplication - DataDomain's de-duplication claims are accurate and the deduplication performance is impressive. I'm seeing compression ratios of 4:1 (ORACLE policy type backups) to 17.1 (general filesystem backups of systems with both STANDARD and MS-WINDOWS-NT policy types. For hot catalog backups I'm seeing a compression ratio of 77:1.
 
Replication - Replication also works well, which makes implementing a DR plan for your backup system much easier. I write hot catalog backups and my DR info to a DSU on my primary restorer once a day. Implementing a DR plan for NetBackup becomes a lot easier with this kind of technology because it takes care of replicating all of your backup data and your catalogs and DR info to your remote site.
 
Performance - Backup performance is very good and restores are wicked fast. Before I got my restorers set up I was running a StorageTek L180 library with eight LTO2 tape drives 24x7. I ran my primary backups and then duplicated them to tape and kept the duplicated copies onsite with a three week retention period. I was very close to running out of slots in the library for backup tapes and onsite duplication and I was getting to a point where having even one drive go down was seriously impacting my schedule. Installing two 460 series restorers, with two at the DR site for replication solved this problem. I write primary backups to the restorer with a one month retention period and then duplicate them to tape with a longer retention period for offsite vaulting. There are still performance issues related to the number of streams that the restorer can handle, if you have streams open for reading data from the restorer for duplication to tape then you can't have as many write streams open for backups, but these are minor. The most I've had to do is stop a running duplication job during the backup window and then let duplication catch up once the backups are done. Being able to restore from a DSU with a long retention period is awesome, it's like having NetApp snapshot restores for all of your data. The Oracle and Exchange administrators I work with love this. Installing the two DataDomain restorers allowed me to hold off on upgrading my tape library for eighteen months.
 
Field engineering support - The DataDomain field engineers I have worked with are knowledgable, efficient and friendly. They reallly know the equipment and know the ins and outs of NetBackup and how best to configure it for use with the DataDomain equipment.DataDomain contracts out their routine technical support to Glasshouse Technologies, who have been OK so far.
 
Ease of installation and configuration - Configuring a restorer takes about 15 minutes. There is a menu driven configuration utility at the CLI that runs you through all of the steps and once that's done you mount the restorer filesystems as NFS volumes or CIFS shares on your NetBackup master or media server, configure these filesystems as disk storage units and start using the system. I have not used the Open Storage Option yet but am looking forward to it. Really the hardest part about configuring a restorer is getting it into a rack.
 
User Interface - The GUI is very good and the CLI is superb. You can have multiple CLI sessions via ssh and the CLI supports tab completion, command line history and if you enter a command without any arguments will tell you what the possible arguments are. There's a CS term to describe this but I don't know what it is, but as an example if you want to see all of the arguments to the command "replication" You type "replication" at the command line. One of the arguments for replication is "show". If you want to see all of the arguments for "replication show" you type "replication show" at the command line and it shows you "replication show history", "replication show config", "replication show performance" and "replication show stats". I'm a CLI guy and I love that I can quickly check on the status of the system by connecting to it with ssh and running a handful of commands instead of having to, as you do with a NetApp filer, connect with a web interface and put up with a GUI because the CLI is crippled. The online documentation in the CLI is also superb with the help system showing good and relevant examples for each command. I've rarely had to RTFM with my restorers.
 
Bad:
 
FLAMING RESTORERS OF DEATH! - Last year we had one of our DDR460 restorers catch fire. Well, actually it didn't catch fire, according to the DataDomain tech support people the restorers are built from UL listed fire resistant materials, so what actually happened is that the system midplane that the disk drives are connected to developed a short circuit, heated to 950 degrees Celsius and melted. I found out about this when I didn't get my morning status e-mail from the restorer in question. I tried pinging it and getting on the console (it was the system at our DR site) and while I was doing so my boss called and asked if I'd checked the equipment at the DR site because he'd gotten a call from the folks who manage it who said that the machine room was smoky and that it smelled like a piece of electrical equipment had caught fire. It turned out that we were the culprits and that it was one of our 460 series restorers had melted down. That afternoon I got an e-mail from DataDomain with a technical support bulletin that said "Oh by the way, if you have a 460 series restorer and the serial number on the midplane is such and such please contact us so we can schedule an engineer to come out and replace it because there's a minor risk that the system could short circuit and let the magic smoke out." DataDomain did replace the restorer and I was lucky as it was the replication target at our DR site and not the primary that I stored my backup images on, but I was really nervous until all of the system midplanes had been replaced on our DDR460 restorers. Apparently this replacement wasn't enough, or DataDomain wasn't comfortable with it as they issued another support bulletin for the 460 series restorers and we had to have all of the midplane boards replaced again this year. I've been a systems administrator for 20 years and worked in a variety of environments with a whole bunch of different gear and this was the first time that I'd ever had the magic smoke escape from a piece of equipment, and let me tell you, that sucker was melted. The damage was contained inside of the case but there was no way we could have salvaged any of the disks even if we had wanted to. Again I'm glad that it was the one at the DR site, which only contained replicated backup images and not one of my primary restorers.
 
Flaky code - DDOS, the Linux based operating system that the restorers run is sitll very much a work in progress. DataDomain releases a major upgrade containing bug fixes for DDOS about every three months. You're pretty much stuck with installing these upgrades as they often contain code fixes necessary to support new SATA disk drive firmware revisions. The upgrade process is quick and easy enough but it's still a PITA because if I ugprade a piece of equipment in the backup system I need to test and document restoring backup images from before the upgrade and backups and restores after the upgrade (and I would do this even if it weren't part of the SOP for my backup system. It's not that I'm paranoid, I'm just that I firmly believe in Murphy's law).
 
The 460 and 565 series restorers use consumer grade Hitachi or Seagate SATA drives, no different than what you would purchase from Fry's. A few weeks back I had a drive fail on my 565 series restorer. DataDomain spotted the failed drive in the daily autosupport and sent me a new drive without me having to do anything. The new drive didn't work, it was the same part number and firmware revision as the drive it replaced but it was from Seagate's Thailand facility, which has been notorious as of late for shipping batches of bad drives. So I requested another drive. The new drive came in, it was a Seagate, with a different firmware revision and date code. I installed it and still didn't get any love from the system. I called up DataDomain and said "what's up" and they told me that they had discovered a bug in DDOS that prevented a failed drive from being replaced if you had the letters "DDR" anywhere in the hostname of a DataDomain restorer. My restorer hostnames all begin with "DDR" (What was I thinking?) So in order to replace this drive I had to temporarily change the hostname with the command "net set hostname". Type "yes" when the system said "Hey, changing your hostname affects replication at the source and target and will require the use of the 'repl modify' command". Unfail the drive with the 'disk unfail', command and then change the hostname back to what it was originally. Right after I got done with that I received an alert from the restorer saying that the drive wasn't a qualified drive model. The drive's part number is the same as the other drives, but the firmware revision is newer. Fun times.
 
I have had problems similar to the one above with my restorers since the day I first powered one on. One of my restorers will mark a drive as failed and half of the time when I call it in I get some bored tech who tells me to remove the drive and then reseat it and see if it still shows up as failed. I shut one of my systems down last month (so I could have the midplane board replaced) and when it came back up it showed as having two failed drives. I called DataDomain and they told me to power the system off, reseat the drives and then power the system back on and use the "disk unfail" command to unfail the disks. This is complete and total BS and it angers me every time they tell me this. If I have a bad drive on a NetApp filer or a Sun/STK RAID and call tech support for either NetApp or Sun they don't tell me to reseat the drive, cycle the power, dance the hokey pokey or put on my ruby slippers, click my heels together three times and type "disk unfail, disk unfail, disk unfail" at the CLI, they just replace the bloody drive, no questions asked. I have complained to DataDomain about this every time it happens and have been told that the next release will fix the problems with checking drive status, really, it will, and the check is in the mail and DataDomain will respect me in the morning too.
 
I have to say that am completely and totally gobsmacked by this latest bug. I cannot imagine any reason why the system hostname should in any way, shape or form have anything to do with the code for checking, changing and controlling drive status in the RAID. I consider the bugs in DataDomain's disk status monitoring to be a huge problem with their equipment and they give me pause and make me nervous about the data I have stored on these systems. While DataDomain's restorers are one of the less expensive de-duplication solutions on the market the fact remains that they're still expensive. DataDomain is claiming that they have an enterprise grade solution, and they certainly have an enterprise grade price, but this kind of thing is not enterprise grade reliability.
 
I hope this helps.
 
Jamie Jamison
Network Systems Administrator
ZymoGenetics, Seattle


Windows Live Hotmail now works up to 70% faster. Sign up today.
_______________________________________________
Veritas-bu maillist  -  Veritas-bu AT mailman.eng.auburn DOT edu
http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
<Prev in Thread] Current Thread [Next in Thread>