NDMP backup vs restore performance

Harry_Redl

Moderator
ADSM.ORG Moderator
Joined
Dec 29, 2003
Messages
2,297
Reaction score
140
Points
0
Location
Czech Republic
Hello all,

doing few NetApp vs TSM NDMP tests.
TSM v 5.3.6.0 (will be upgraded, but do not think it is relevant now)
NetApp FAS3040 (IBM n5300)
IBM LTO4 drives in TS3310 - connected via FC to filer

What seems strange to me is that when I backup one volume (300GB) it takes roughly 3hrs (without TOC - I do not need one)
Now I am doing restore to alternate volume on the same filer (full+diff)
After almost 2hrs the process is at 2,8GB so the estimation is cca 8 days
Load of the filer is CPU <20%, disk <50% (restore is running)
using NDMP v4

command used for backup

backup node netapp1 /vol/path toc=no mode=full (one day old)
backup node netapp1 /vol/path toc=no mode=diff (current)

command used for restore
restore node netapp1 /vol/path /vol/newpath

Any real-world comparison? Hint?

Thanks

Harry


*************
UPDATE
*************

the speed seems to be very far from constant ...
After cca 11hrs I have about 90GB restored and current speed is roughly 15GB/hr (started at 1GB/hr)
Filer's statistics are almost the same:
CPU <35%, disk <55%

Do all have the same experience?
 
Last edited:
Since the restore and backup are REALLY handled by the NETAPP filer I would look at the filer itself for the slow progress. When it comes to NDMP tasks TSM is just a handler of tapes and hands over all backup and restore processing once the commands are issued. Sounds like something is screwy on the filer.
 
Hi Chad,

yes - IMHO it must be in the filer. The restore ended after roughly 30hrs with total amount of 340GB. Monitoring the filer shows no heavy load (throughput, disk, CPU) - atleast seems so.
Bad thing is I am unable to find out what was running against the filer - sometimes even customer does not know :)
Maybe the problem is that the filer has used space >90% which is really not goor for WAFL, maybe it was slowed down during daytime operations, maybe NDMPv3 can solve the problem ... will try and report.

Just wanted other's real world numbers to compare

Thanks

Harry
 
Hi all again,

updating the info:

tried to restore to another filer (slower but dedicated disks with plenty of space)
tried NDMP v3
the results are not good - see the restore graph
Anyone knows, what the filer does during the "flat" area of the graph? Not a single byte transferred in 10hrs.

Do you have same results?

Thanks

Harry
 

Attachments

  • graf_thumb.gif
    graf_thumb.gif
    51.7 KB · Views: 49
Harry,
You could check the filer performance by performing an NDMP dump from the source filer to the destination filer. This would take TSM and tape media out of the picture to provide you a baseline of performance.
I am able to restore 100GB LUNs in a couple of hours from TSM. Are you using Aggregrates with lots of disks? The number of spindles you are working with has a significant impact on performance.

Cheers,
Neil
 
Hi all,

thanks Neil - your suggestion about using NDMPcopy was good - it did not solve the problem but it is now obvious where it is.
TSM is out of picture now - got very similar results using filer-to-filer NDMP transfer.
Thing is that NDMP restore runs in multiple phases - important for me are three
1) reads data from dump needed for building filesystem structure
2) builds filesystem structure
3) reads data from dump to populate the filesystem

And the catch is - phase 2
My filesystem has cca 50 million files - building that takes more than 14hrs. And if you do FULL+DIFF restore you go through !three! times (as it seems from the graph).

Here are excerpts from the NDMPcopy logs explaining it all:


netapp1*> ndmpcopy -sa root:xxxxxx -da root:xxxxxx /vol/thumb netapp2:/vol/testthumb
Ndmpcopy: Starting copy [ 1 ] ...
Ndmpcopy: netapp1: Notify: Connection established
Ndmpcopy: netapp2: Notify: Connection established
Ndmpcopy: netapp1: Connect: Authentication successful
Ndmpcopy: netapp2: Connect: Authentication successful
Ndmpcopy: netapp1: Log: DUMP: creating "/vol/thumb/../snapshot_for_backup.54" snapshot.
Ndmpcopy: netapp1: Log: DUMP: Using Full Volume Dump
Ndmpcopy: netapp1: Log: DUMP: Date of this level 0 dump: Sat May 3 13:44:11 2008.
Ndmpcopy: netapp1: Log: DUMP: Date of last level 0 dump: the epoch.
Ndmpcopy: netapp1: Log: DUMP: Dumping /vol/thumb to NDMP connection
Ndmpcopy: netapp1: Log: DUMP: mapping (Pass I)[regular files]
Ndmpcopy: netapp1: Log: DUMP: mapping (Pass II)[directories]
Ndmpcopy: netapp1: Log: DUMP: estimated 349602751 KB.
Ndmpcopy: netapp1: Log: DUMP: dumping (Pass III) [directories]
Ndmpcopy: netapp2: Log: RESTORE: Sat May 3 14:05:34 2008: Begin level 0 restore
Ndmpcopy: netapp2: Log: RESTORE: Sat May 3 14:05:34 2008: Reading directories from the backup
...
...
Ndmpcopy: netapp1: Log: DUMP: Sat May 3 21:10:19 2008 : We have written 10091858 KB.
Ndmpcopy: netapp2: Log: RESTORE: Sat May 3 21:10:19 2008 : We have read 10090500 KB from the backup.
Ndmpcopy: netapp1: Log: DUMP: dumping (Pass IV) [regular files]
Ndmpcopy: netapp2: Log: RESTORE: Sat May 3 21:14:23 2008: Creating files and directories.
Ndmpcopy: netapp2: Log: RESTORE: Sat May 3 21:15:19 2008 : We have created 42719 files and directories.
Ndmpcopy: netapp2: Log: RESTORE: Sat May 3 21:20:19 2008 : We have created 401587 files and directories.
...
...
Ndmpcopy: netapp2: Log: RESTORE: Sun May 4 18:20:19 2008 : We have created 50200361 files and directories.
Ndmpcopy: netapp2: Log: RESTORE: Sun May 4 18:25:19 2008 : We have created 50295702 files and directories.
Ndmpcopy: netapp2: Log: RESTORE: Sun May 4 18:30:19 2008 : We have created 50369583 files and directories.
Ndmpcopy: netapp2: Log: RESTORE: Sun May 4 18:44:35 2008: Writing data to files.
Ndmpcopy: netapp2: Log: RESTORE: Sun May 4 18:44:35 2008 : We have read 10169239 KB from the backup.
Ndmpcopy: netapp1: Log: DUMP: Sun May 4 18:44:35 2008 : We have written 10170649 KB.
Ndmpcopy: netapp1: Log: DUMP: Sun May 4 18:49:35 2008 : We have written 15945551 KB.
Ndmpcopy: netapp2: Log: RESTORE: Sun May 4 18:49:35 2008 : We have read 15944285 KB from the backup.
...
...
Ndmpcopy: netapp1: Log: DUMP: Mon May 5 00:49:35 2008 : We have written 351736745 KB.
Ndmpcopy: netapp2: Log: RESTORE: Mon May 5 00:49:35 2008 : We have read 351736037 KB from the backup.
Ndmpcopy: netapp2: Log: RESTORE: Mon May 5 00:52:10 2008: Restoring NT ACLs.
Ndmpcopy: netapp1: Log: DUMP: dumping (Pass V) [ACLs]
Ndmpcopy: netapp1: Log: DUMP: 354542959 KB
Ndmpcopy: netapp1: Log: DUMP: DUMP IS DONE
Ndmpcopy: netapp1: Log: DUMP: Deleting "/vol/thumb/../snapshot_for_backup.54" snapshot.
Ndmpcopy: netapp2: Log: RESTORE: RESTORE IS DONE
Ndmpcopy: netapp2: Log: RESTORE: The destination path is /vol/testthumb/
Ndmpcopy: netapp2: Notify: restore successful
Ndmpcopy: netapp1: Notify: dump successful
Ndmpcopy: Transfer successful [ 1 days 11 hours 8 minutes 21 seconds ]
Ndmpcopy: Done


Asking IBM (as it is IBM-branded NetApp) if something can be done ....

Harry
 
Harry,
If you can, try to create large aggregrates with as many disks as you can. The number of spinning disks available to write to has a significant impact on performance. We typically have one or two aggregrates in a system with each aggregrate based on disk size.

Another alternative might be to try TSM virtualfsmappings to allow you to backup/restore using multiple threads. I am currently dealing with LUNs in my NDMP environment.

We currently use 4 dedicated windows boxes to backup about 50 million files on 2 clustered FAS3020s. We are able to backup everything in about 24 hours and will be adding a couple more dedicated backup boxes to get this down to 12 hours.

Cheers,
Neil
 
Hi Neil,

yes, we have only one aggregate for this disk type. Backup process is OK, FULL is done in cca 3.5hrs, DIFF takes roughly 50 minutes. It is the restore time what causes problem.
Have you ever tried to do FULL restore on your box? How long did it take? (to have real-world comparison).
We still have "normal" proggressive backup methodology in place - it seems good for individual file restores but for DR it seems unusable. Estimation for full DR using filesystem-level restore (not NDMP) is cca 14 days ...


Harry
 
Harry,
We mirror everything and TSM is a secondary method of recovery. Fortunately we have not had to perform a full TSM recovery.
I have performed a few NDMP recoveries of NetApp volumes which have LUNs in them and these restore almost as quickly as they backup.

You might consider looking into a snapshot/snapmirror solution to protect the data and then integrate TSM for long term retention/archive. A really nice feature of this solution is that users can restore files from snapshot themselves. Most restores are accomplished this way. One advantage of snapshots is that they are performed several time a day - so users are protected with an RPO of hours or minutes depending on the criticality of the data.
A few older files or directories are periodically restored from TSM. We exercise a full system failover twice a year to the mirror with little fanfare.

Cheers,
Neil
 
Last edited:
Hi,

cfhoffman: no, it is not the same - I want to restore full volume so I am reading all data. It is read rather quickly (5hrs), but building the filesystem from that data on NetApp side is slow.
Simple calculation - if every file must be "touched" and every single one takes a millisecond, then I have 50,000 secs just for this - it is roughly 14hrs. And that is what I see.

Harry
 
Is your 3040 SATA disk? We ran into similar issues with a 3070, once we moved from SATA to FC we saw (obvious) increases.

We also were running Fulls+diffs and the restore process took too long for the customers requirements and didnt see much need for it as the full was restored and the diff restored on-top of it. It was quicker to just do fulls nightly and restore a full in less time than a full + diff.

When watching the filers, we noticed similar numbers and the CPU didnt seem as if much was happened, but the disk just couldnt handle the IO.
 
Back
Top