Amanda-Users

Re: Troubleshooting a slowdown problem?

2004-08-13 16:34:44
Subject: Re: Troubleshooting a slowdown problem?
From: Frank Smith <fsmith AT hoovers DOT com>
To: KEVIN ZEMBOWER <KZEMBOWE AT jhuccp DOT org>, amanda-users AT amanda DOT org
Date: Thu, 17 Jun 2004 14:54:44 -0500
--On Thursday, June 17, 2004 12:02:29 -0400 KEVIN ZEMBOWER <KZEMBOWE AT jhuccp 
DOT org> wrote:

> A couple of months ago, I added a server, centernet, to my Amanda backups. 
> Since that time, about once a week or two, but not everyday, the backup runs 
> for 16-20 hours, instead of its normal less than 8. It's running right now 
> since last night at
> 8:00pm: amanda@admin:~ > amstatus DailySet1
> Using /var/log/amanda/DailySet1/amdump from Wed Jun 16 20:00:00 EDT 2004
> 
> admin://db/c$                       0   462720k finished (20:56:41)
> admin://db/e$                       1       10k finished (20:16:16)
> admin://db/f$                       1  2559660k finished (20:51:25)
> admin://db/f$/inetsrv/webpub/images 1       30k finished (20:16:07)
> admin:sda1                          0     3410k finished (20:17:09)
> admin:sda3                          0  3683924k wait for dumping 
> admin:sdb1                          0    24270k finished (20:17:05)
> centernet:sda1                      0     4846k finished (20:06:03)
> centernet:sda2                      0   715715k finished (2:15:47)
> centernet:sda3                      0   110883k finished (20:58:42)
> centernet:sda5                      0  1812031k dumping  1332992k ( 73.56%) 
> (2:02:34)
> centernet:sda6                      1       73k finished (20:03:37)
> centernet:sda7                      0      564k finished (20:04:03)
> centernet:sda9                      0    30391k finished (20:18:59)
> mailinglists:hda1                   0     2198k finished (20:04:30)
> mailinglists:hda2                   0   399339k finished (23:03:41)
> mailinglists:hda7                   0   743827k finished (4:41:56)
> 
> SUMMARY          part      real  estimated
>                            size       size
> partition       :  17
> estimated       :  17             11566387k
> flush           :   0         0k
> failed          :   0                    0k           (  0.00%)
> wait for dumping:   1              3683924k           ( 31.85%)
> dumping to tape :   0                    0k           (  0.00%)
> dumping         :   1   1332992k   1812031k ( 73.56%) ( 11.52%)
> dumped          :  15   5057936k   6070432k ( 83.32%) ( 43.73%)
> wait for writing:   0         0k         0k (  0.00%) (  0.00%)
> wait to flush   :   0         0k         0k (100.00%) (  0.00%)
> writing to tape :   0         0k         0k (  0.00%) (  0.00%)
> failed to tape  :   0         0k         0k (  0.00%) (  0.00%)
> taped           :  15   5057936k   6070432k ( 83.32%) ( 43.73%)
> 7 dumpers idle  : no-hold
> taper idle
> network free kps:     26362
> holding space   :  34661948k ( 95.03%)
>  dumper0 busy   :  8:13:03  ( 95.12%)
>  dumper1 busy   :  6:24:10  ( 74.11%)
>  dumper2 busy   :  2:52:40  ( 33.31%)
>    taper busy   :  1:07:38  ( 13.05%)
>  0 dumpers busy :  0:00:00  (  0.00%)
>  1 dumper busy  :  0:13:42  (  2.64%)             no-hold:  0:13:42  (100.00%)
>  2 dumpers busy :  7:57:50  ( 92.18%)  client-constrained:  5:31:58  ( 69.47%)
>                                                   no-hold:  2:25:40  ( 30.49%)
>                                                start-wait:  0:00:11  (  0.04%)
>  3 dumpers busy :  0:26:50  (  5.18%)  client-constrained:  0:26:47  ( 99.77%)
>                                                start-wait:  0:00:03  (  0.23%)
> amanda@admin:~ > 
> 
> What can be determined from this status regarding the reasons for the backup 
> of centernet:sda5 to be so slow? Actually, I guess it was either 
> centernet:sda1 or centernet:sda3 which took over 20 hours (am I reading this 
> correctly, or is it 20 minutes,
> or is this a time-of-day?).

I think the amstatus times are time-of-day, but I've been wrong before.  It's
much easier (for me anyway) to debug problems with the daily report of the
completed run instead of an amstatus from the middle of it.
   It looks as though much of your backup time is spent 'client constained'
or 'no holding disk'.  The 'no hold' can be a real killer, it means your
dump is going direct to tape, and if the dump isn't as fast as the tape (it
usually isn't) then the tape drive has to constantly stop, reposition itself.
write, stop, reposition, etc., seriously degrading throughput. Also, Amanda
can only do one direct-to-tape dump at a time instead of running multiple
dumps in parallel.  Since you don't seem to have enough holdingdisk Amanda
can dump other filesystems in the meantime, either. How big is your holding 
disk?
   The client constrained could be slow network or doing client compress on
a slow box or a few other things,  The daily report will give you a better idea
of where the time is spent.
   You're backing up less than 12GB, it really shouldn't be taking that long.
   
> 
> ps on centernet doesn't show anything abnormal:
> cn2:~# ps aux |grep amanda
> amanda   27333  0.0  0.2  1680  636 ?        S    02:01   0:00 
> /usr/local/libexec/sendbackup
> amanda   27335  1.9  0.2  1596  600 ?        S    02:01  11:08 /bin/gzip 
> --fast
> amanda   27336  0.0  0.1  1904  364 ?        S    02:01   0:00 dump 0usf 
> 1048576 - /dev/sda5
> amanda   27337  0.0  0.2  1956  664 ?        S    02:01   0:07 dump 0usf 
> 1048576 - /dev/sda5
> amanda   27338  0.0  0.1  1904  488 ?        S    02:01   0:12 dump 0usf 
> 1048576 - /dev/sda5
> amanda   27339  0.0  0.1  1904  504 ?        S    02:01   0:11 dump 0usf 
> 1048576 - /dev/sda5
> amanda   27340  0.0  0.1  1904  480 ?        S    02:01   0:12 dump 0usf 
> 1048576 - /dev/sda5
> root     29630  0.0  0.1  1336  436 pts/1    S    11:43   0:00 grep amanda
> cn2:~# 
> 
> And the partitions on this host aren't outrageous:
> cn2:~# df -h
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/sda2             7.3G  1.5G  5.5G  21% /
> /dev/sda1             7.6M  5.6M  1.6M  78% /boot
> /dev/sda3             4.6G  291M  4.0G   7% /usr
> /dev/sda5             4.6G  2.0G  2.3G  46% /opt/analog/logdata
> /dev/sda6             3.7G  794M  2.7G  23% /var/www/centernet/htdocs
> /dev/sda7             3.7G  5.3M  3.4G   1% /var/lib/mysql
> /dev/sda9             2.8G  572M  2.0G  22% /var/www/centernet/logs
> cn2:~# 
> 
> Here's the relevant parts of disklist and amanda.config:
> amanda@admin:/etc/amanda/DailySet1 > grep centernet disklist
> centernet sda1 comp-user                # /boot
> centernet sda2 comp-user                # /
> centernet sda3 comp-user                # /usr
> centernet sda5 comp-user                # /opt/analog/logdata
> centernet sda6 comp-user                # /var/www/centernet/htdocs
> centernet sda7 comp-user                # /var/lib/mysql
> centernet sda9 comp-user                # /var/www/centernet/logs
> amanda@admin:/etc/amanda/DailySet1 > 
> 
> amanda@admin:/etc/amanda/DailySet1 > egrep -v "(^( |\t)*#|^$)" amanda.conf 
> org "JHU/CCP"           # your organization name for reports
> mailto "isgalert AT jhuccp DOT org"            # space separated list of 
> operators at your site
> dumpuser "amanda"       # the user to run dumps under
> inparallel 8            # maximum dumpers that will run in parallel (max 63)
> dumporder "tttttttt"    # specify the priority order of each dumper
> netusage  25000 Kbps    # maximum net bandwidth for Amanda, in KB per sec
> dumpcycle 3             # the number of days in the normal dump cycle
> runspercycle 3          # the number of amdump runs in dumpcycle days
> tapecycle 25 tapes      # the number of tapes in rotation
> bumpsize 20 Mb          # minimum savings (threshold) to bump level 1 -> 2
> bumpdays 1              # minimum days at each level
> bumpmult 4              # threshold = bumpsize * bumpmult^(level-1)
> etimeout 300            # number of seconds per filesystem for estimates.
> dtimeout 1800           # number of idle seconds before a dump is aborted.
> ctimeout 30             # maximum number of seconds that amcheck waits
> tapebufs 20
> tapedev "/dev/nst0"     # the no-rewind tape device to be used
> rawtapedev "/dev/null"  # the raw device to be used (ftape only)
> tapetype Python-DDS3            # what kind of tape it is (see tapetypes 
> below)
> labelstr "^DailySet1[0-9][0-9]*$"       # label constraint regex: all tapes 
> must match
> holdingdisk hd1 {
>     comment "main holding disk"
>     directory "/var/amanda"     # where the holding disk is
>     use -0Mb            # how much space can we use on it. Use everything.
>     chunksize 1Gb       # size of chunk if you want big dump to be
>     }
> holdingdisk hd2 {
>     directory "/dumps2/amanda"
>     use -0 Mb
>     }
> reserve 50 # percent
> autoflush yes #
> infofile "/var/log/amanda/DailySet1/curinfo"    # database DIRECTORY
> logdir   "/var/log/amanda/DailySet1"            # log directory
> indexdir "/var/log/amanda/DailySet1/index"      # index directory
> define tapetype Python-DDS3 {
>     comment "Dell Python with DDS-3 tapes"
>     length 11570 mbytes
>     filemark 0 kbytes
>     speed 1078 kps 
>     lbl-templ "/usr/local/etc/amanda/DailySet1/3holeJHUCCP.ps"
> }
> define dumptype global {
>     comment "Global definitions"
> }
> define dumptype comp-user {
>     global
>     comment "Non-root partitions on reasonably fast machines"
>     compress client fast
>     priority medium
> }
> define interface local {
>     comment "a local disk"
>     use 1000 kbps
> }
> define interface le0 {
>     comment "10 Mbps ethernet"
>     use 400 kbps


This will probably keep you from backing up more than one remote filesystem
at a time, since if Amanda is using more than 50KB/sec (400kb/sec) it won't
start another dumper.

> }
> amanda@admin:/etc/amanda/DailySet1 > 
> 
> Thanks for any suggestions on what's happening, and how to fix it. Please let 
> me know if there's some other diagnostic I should run to further define this 
> problem.

When the run finishes, post your daily report and we can give you more specific
suggestions.

Frank

> 
> -Kevin Zembower
> 
> 
> -----
> E. Kevin Zembower
> Unix Administrator
> Johns Hopkins University/Center for Communications Programs
> 111 Market Place, Suite 310
> Baltimore, MD  21202
> 410-659-6139



-- 
Frank Smith                                      fsmith AT hoovers DOT com
Sr. Systems Administrator                       Voice: 512-374-4673
Hoover's Online                                   Fax: 512-374-4501


<Prev in Thread] Current Thread [Next in Thread>
  • Re: Troubleshooting a slowdown problem?, Frank Smith <=