Amanda-Users

planner timeouts

2005-08-30 21:05:24
Subject: planner timeouts
From: Charles Sprickman <spork AT bway DOT net>
To: amanda-users AT amanda DOT org
Date: Tue, 30 Aug 2005 20:43:09 -0400 (EDT)
Hi,

Recently we started having trouble getting backups from one of our servers. One day it was working, the next it wasn't. I've been digging around quite a bit and I'm not able to really see where the failure is. It appears that the planner process on the client machine times out, but I'm not sure why, and I'm seeing nothing in the debug logs on the client to indicate what the problem is. I've reviewed all the changes on the client from when it worked until when it didn't and haven't seen anything that involves amanda or networking. We do run firewalls on each host, and I've tested with them disabled.

This is a rather old version of Amanda and we will be updating as soon as we get the OS and other software up-to-date on all the hosts. Version is 2.4.3.

The server (devel2) is in one location (an office with dsl, no pppoe) and the clients are located in a datacenter. All of the other hosts on the same network are backing up with no problems. All other hosts are running the same version of amanda, and most are running the same version of FreeBSD (4.11). Some hosts have more data to backup, some have less.

All the timeouts in amanda.conf on the server have been bumped up:
ctimeout 32000                  # timeout for clients
dtimeout 16000                  # timeout for dump processes
etimeout 32000                  # timeout for dump estimates

The logs are lengthy, I apologize in advance. Below I have included the job output from the server, all client debug logs, all server debug logs, normal logs from the server, and a tcpdump of the session as seen from both hosts. The host "devel2" is the backup server, "h13" is the client that is failing.

----

job output:

FAILURE AND STRANGE DUMP SUMMARY:
  h13.blah /spool lev 0 FAILED [Request to h13.blah.com timed out.]
h13.blah /var/qmail/bin lev 0 FAILED [Request to h13.blah.com timed out.] h13.blah /var/qmail/control lev 0 FAILED [Request to h13.blah.com timed out.]
  h13.blah /var/db/pkg lev 0 FAILED [Request to h13.blah.com timed out.]
  h13.blah /usr/local/ lev 0 FAILED [Request to h13.blah.com timed out.]
  h13.blah /home lev 0 FAILED [Request to h13.blah.com timed out.]
  h13.blah / lev 0 FAILED [Request to h13.blah.com timed out.]

h13 (client) debug logs. Note that there is two-way communication, and everything seems to go correctly. In the debug dir, there are only "amandad" debug logs, nothing else.

amandad: time 0.000: got packet:
--------
Amanda 2.4 REQ HANDLE 000-80930808 SEQ 1125445583
SECURITY USER operator
SERVICE noop
OPTIONS features=fffffeff9f00;
--------

amandad: time 0.000: sending ack:
----
Amanda 2.4 ACK HANDLE 000-80930808 SEQ 1125445583
----

amandad: time 0.089: bsd security: remote host devel2 user operator local user operator
amandad: time 0.175: amandahosts security check passed
amandad: time 0.175: running service "noop"
amandad: time 0.176: sending REP packet:
----
Amanda 2.4 REP HANDLE 000-80930808 SEQ 1125445583
OPTIONS features=fffffeff9f00; ----

amandad: time 0.210: got packet:
----
Amanda 2.4 ACK HANDLE 000-80930808 SEQ 1125445583
----

amandad: time 0.210: pid 19416 finish time Tue Aug 30 19:46:24 2005

devel2 (server) debug logs.  Not sure if this is useful, but here it is:

amtrmidx.20050830194655.debug :

h13.blah.com /
rm /var/db/amanda/index/h13.blah.com/_/20050707_2.gz
rm /var/db/amanda/index/h13.blah.com/_/20050703_2.gz
h13.blah.com /home
rm /var/db/amanda/index/h13.blah.com/_home/20050707_6.gz
h13.blah.com /usr/local/
rm /var/db/amanda/index/h13.blah.com/_usr_local_/20050707_0.gz
rm /var/db/amanda/index/h13.blah.com/_usr_local_/20050703_2.gz
h13.blah.com /var/db/pkg
rm /var/db/amanda/index/h13.blah.com/_var_db_pkg/20050707_1.gz
rm /var/db/amanda/index/h13.blah.com/_var_db_pkg/20050703_0.gz
h13.blah.com /var/qmail/control
rm /var/db/amanda/index/h13.blah.com/_var_qmail_control/20050707_1.gz
rm /var/db/amanda/index/h13.blah.com/_var_qmail_control/20050703_1.gz
h13.blah.com /var/qmail/bin
rm /var/db/amanda/index/h13.blah.com/_var_qmail_bin/20050707_1.gz
rm /var/db/amanda/index/h13.blah.com/_var_qmail_bin/20050703_1.gz
h13.blah.com /spool

amtrmlog.20050830194654.debug:

amtrmlog: debug 1 pid 39378 ruid 2 euid 2: start at Tue Aug 30 19:46:54 2005
/usr/local/libexec/amanda/amtrmlog: version 2.4.3
Keeping 48 log files
amtrmlog: pid 39378 finish time Tue Aug 30 19:46:55 2005

standard log file. we do see the planner timeout here, but it's not clear to me why.:

/var/log/amanda/log.1:

dumper: dgram_bind: socket bound to 0.0.0.0.943
dumper: pid 39290 executable dumper2 version 2.4.3, using port 943
dumper: dgram_bind: socket bound to 0.0.0.0.945
dumper: pid 39292 executable dumper4 version 2.4.3, using port 945
dumper: dgram_bind: socket bound to 0.0.0.0.942
dumper: pid 39289 executable dumper1 version 2.4.3, using port 942
dumper: dgram_bind: socket bound to 0.0.0.0.944
dumper: pid 39291 executable dumper3 version 2.4.3, using port 944
planner: time 0.028: dgram_bind: socket bound to 0.0.0.0.937
READING CONF FILES...
dumper: dgram_bind: socket bound to 0.0.0.0.941
dumper: pid 39288 executable dumper0 version 2.4.3, using port 941
planner: time 0.052: startup took 0.052 secs

SETTING UP FOR ESTIMATES...
planner: time 0.052: setting up estimates for h13.blah.com:/
setup_estimate: h13.blah.com:/: command 0, options:
    last_level 1 next_level0 6 level_days 1
    getting estimates 0 (47219) 1 (88) -1 (-1)
planner: time 0.076: setting up estimates for h13.blah.com:/home
h13.blah.com:/home overdue 19 days for level 0
setup_estimate: h13.blah.com:/home: command 0, options:
    last_level 7 next_level0 -19 level_days 2
    getting estimates 0 (15061108) 7 (4581318) 8 (0)
planner: time 0.088: setting up estimates for h13.blah.com:/usr/local/
setup_estimate: h13.blah.com:/usr/local/: command 0, options:
    last_level 1 next_level0 6 level_days 1
    getting estimates 0 (238435) 1 (93) -1 (-1)
planner: time 0.093: setting up estimates for h13.blah.com:/var/db/pkg
setup_estimate: h13.blah.com:/var/db/pkg: command 0, options:
    last_level 1 next_level0 4 level_days 2
    getting estimates 0 (3550) 1 (80) 2 (0)
planner: time 0.111: setting up estimates for h13.blah.com:/var/qmail/controlsetup_estimate: h13.blah.com:/var/qmail/control: command 0, options:
    last_level 1 next_level0 6 level_days 1
    getting estimates 0 (30) 1 (10) -1 (-1)
setup_estimate: h13.blah.com:/var/qmail/bin: command 0, options:
    last_level 1 next_level0 6 level_days 1
    getting estimates 0 (660) 1 (10) -1 (-1)
planner: time 0.131: setting up estimates for h13.blah.com:/spool
h13.blah.com:/spool overdue 3 days for level 0
setup_estimate: h13.blah.com:/spool: command 0, options:
    last_level 3 next_level0 -3 level_days 2
    getting estimates 0 (3841501) 3 (4172012) 4 (0)
planner: time 0.144: setting up estimates took 0.091 secs

GETTING ESTIMATES...
planner: time 30.956: error result for host h13.blah.com disk /spool: Request to h13.blah.com timed out. planner: time 30.956: error result for host h13.blah.com disk /var/qmail/bin: Request to h13.blah.com timed out. planner: time 30.956: error result for host h13.blah.com disk /var/qmail/control: Request to h13.blah.com timed out. planner: time 30.956: error result for host h13.blah.com disk /var/db/pkg: Request to h13.blah.com timed out. planner: time 30.956: error result for host h13.blah.com disk /usr/local/: Request to h13.blah.com timed out. planner: time 30.956: error result for host h13.blah.com disk /home: Request to h13.blah.com timed out. planner: time 30.956: error result for host h13.blah.com disk /: Request to h13.blah.com timed out.
planner: time 30.956: getting estimates took 30.811 secs
FAILED QUEUE:
  0: h13.blah.com /spool
  1: h13.blah.com /var/qmail/bin
  2: h13.blah.com /var/qmail/control
  3: h13.blah.com /var/db/pkg
  4: h13.blah.com /usr/local/
  5: h13.blah.com /home
  6: h13.blah.com /
DONE QUEUE: empty

ANALYZING ESTIMATES...
planner: FAILED h13.blah.com /spool 20050830 0 [Request to h13.blah.com timed out.] planner: FAILED h13.blah.com /var/qmail/bin 20050830 0 [Request to h13.blah.com timed out.] planner: FAILED h13.blah.com /var/qmail/control 20050830 0 [Request to h13.blah.com timed out.] planner: FAILED h13.blah.com /var/db/pkg 20050830 0 [Request to h13.blah.com timed out.] planner: FAILED h13.blah.com /usr/local/ 20050830 0 [Request to h13.blah.com timed out.] planner: FAILED h13.blah.com /home 20050830 0 [Request to h13.blah.com timed out.] planner: FAILED h13.blah.com / 20050830 0 [Request to h13.blah.com timed out.]
INITIAL SCHEDULE (size 2064):

DELAYING DUMPS IF NEEDED, total_size 2064, tape length 26214400 mark 1000
  delay: Total size now 2064.

PROMOTING DUMPS IF NEEDED, total_lev0 0, balanced_size 0...
planner: time 30.956: analysis took 0.000 secs

GENERATING SCHEDULE:
--------
ENDFLUSH
--------
driver: adding holding disk 0 dir /spool2/amanda size 37089049
reserving 14835619 out of 37089049 for degraded-mode dumps
driver: start time 30.972 inparallel 5 bandwidth 240 diskspace 37089049 dir OBSOLETE datestamp 20050830 driver: drain-ends tapeq LFFO big-dumpers ttt driver: result time 30.972 from taper: TAPE-ERROR [tape_rdlabel: tape open: /dev/nsa0: Device not configured]
dump of driver schedule before start degraded mode:
--------
--------
driver: state time 30.972 free kps: 240 space: 37089049 taper: DOWN idle-dumpers: 5 qlen tapeq: 0 runq: 0 roomq: 0 wakeup: 86400 driver-idle: not-idle
driver: interface-state time 30.972 if : free 120 if EXTERNAL: free 120
driver: hdisk-state time 30.972 hdisk 0: free 37089049 dumpers 0
driver: QUITTING time 30.972 telling children to quit
driver: send-cmd time 30.972 to dumper0: QUIT
driver: send-cmd time 30.972 to dumper1: QUIT
driver: send-cmd time 30.972 to dumper2: QUIT
driver: send-cmd time 30.973 to dumper3: QUIT
driver: send-cmd time 30.973 to dumper4: QUIT
driver: send-cmd time 30.973 to taper: QUIT
taper: DONE [idle wait: 30.906 secs]
driver: FINISHED time 30.975
amdump: end at Tue Aug 30 19:46:54 EDT 2005

tcpdump on both machines recording all traffic between them. Note that we see a number of packets at the end leave devel2, h13 sees them, but does not respond. Again, the firewall is disabled in this test.

The devel2 (server) view:

19:46:23.967162 devel2.937 > h13.blah.com.amanda: udp 117
19:46:24.074337 h13.blah.com.amanda > devel2.937: udp 50
19:46:24.249414 h13.blah.com.amanda > devel2.937: udp 81
19:46:24.249497 devel2.937 > h13.blah.com.amanda: udp 50
19:46:24.489787 devel2.937 > h13.blah.com.amanda: udp 1465
19:46:34.497794 devel2.937 > h13.blah.com.amanda: udp 1465
19:46:44.508815 devel2.937 > h13.blah.com.amanda: udp 1465

The h13 (client) view:

19:46:23.982760 devel2.937 > h13.blah.com.amanda: udp 117
19:46:24.054390 h13.blah.com.amanda > devel2.937: udp 50
19:46:24.230317 h13.blah.com.amanda > devel2.937: udp 81
19:46:24.264200 devel2.937 > h13.blah.com.amanda: udp 50
19:46:24.523791 devel2.937 > h13.blah.com.amanda: udp 1465 (frag 59731:1472@0+) 19:46:34.531821 devel2.937 > h13.blah.com.amanda: udp 1465 (frag 62763:1472@0+) 19:46:44.542471 devel2.937 > h13.blah.com.amanda: udp 1465 (frag 9535:1472@0+)

Any help with this is appreciated, I'm totally stumped.

Thanks,

Charles

<Prev in Thread] Current Thread [Next in Thread>