ADSM-L

[ADSM-L] GPFS file system backup problem

2008-01-29 05:58:05
Subject: [ADSM-L] GPFS file system backup problem
From: Michael Green <mishagreen AT GMAIL DOT COM>
To: ADSM-L AT VM.MARIST DOT EDU
Date: Tue, 29 Jan 2008 12:57:30 +0200
Bright minds,

Some time ago a problem has arisen with one of the GPFS file systems
that I happen to backup.
The system I'm talking about is:
Server: IBM x345 with SLES 9 SP3
FC switch: Cisco MDS9124
Storage: DS4400 (formerly FAStT700 ), latest firmware
GPFS ver: 2.3
Multipathing: IBM supplied RDAC (Linux MPP Driver Version: 09.01.B5.76)
TSM client: 5.4.1.2; server: 5.3.6
The filesystem:

bioinfo4:~ # df -h /srv
Filesystem            Size  Used Avail Use% Mounted on
/dev/gpfs1            452G  377G   76G  84% /srv
bioinfo4:~ # mmlsfs /dev/gpfs1
flag value          description
---- -------------- -----------------------------------------------------
 -s  roundRobin     Stripe method
 -f  2048           Minimum fragment size in bytes
 -i  512            Inode size in bytes
 -I  8192           Indirect block size in bytes
 -m  1              Default number of metadata replicas
 -M  1              Maximum number of metadata replicas
 -r  1              Default number of data replicas
 -R  1              Maximum number of data replicas
 -j  cluster        Block allocation type
 -D  posix          File locking semantics in effect
 -k  posix          ACL semantics in effect
 -a  1048576        Estimated average file size
 -n  32             Estimated number of nodes that will mount file system
 -B  65536          Block size
 -Q  user;group     Quotas enforced
     user;group     Default quotas enabled
 -F  6999936        Maximum number of inodes
 -V  8.01           File system version. Highest supported version: 8.02
 -u  yes            Support for large LUNs?
 -z  no             Is DMAPI enabled?
 -E  yes            Exact mtime mount option
 -S  no             Suppress atime mount option
 -d  gpfs4nsd  Disks in file system
 -A  yes            Automatic mount option
 -o  none           Additional mount options
 -T  /srv           Default mount point


Basically what happens is that the backup of that particular file
system never completes,  cuts short with return code 12.
I have two GPFS file systems on that linux box, both reside on the
same storage and are identically connected in terms of storage, FC
topology and multipathing.
One backs up without a hitch while the other doesn't. Log excerpt
below illustrates what's going on.
Online GPFS fsck returns no errors (mmfsck /dev/gpfs1 -o). I haven't
tried offline fsck.

Any ideas on how to proceed about this problem will be appreciated!

bioinfo4:~ # mmlsfs /dev/gpfs1
flag value          description
---- -------------- -----------------------------------------------------
 -s  roundRobin     Stripe method
 -f  2048           Minimum fragment size in bytes
 -i  512            Inode size in bytes
 -I  8192           Indirect block size in bytes
 -m  1              Default number of metadata replicas
 -M  1              Maximum number of metadata replicas
 -r  1              Default number of data replicas
 -R  1              Maximum number of data replicas
 -j  cluster        Block allocation type
 -D  posix          File locking semantics in effect
 -k  posix          ACL semantics in effect
 -a  1048576        Estimated average file size
 -n  32             Estimated number of nodes that will mount file system
 -B  65536          Block size
 -Q  user;group     Quotas enforced
     user;group     Default quotas enabled
 -F  6999936        Maximum number of inodes
 -V  8.01           File system version. Highest supported version: 8.02
 -u  yes            Support for large LUNs?
 -z  no             Is DMAPI enabled?
 -E  yes            Exact mtime mount option
 -S  no             Suppress atime mount option
 -d  gpfs4nsd  Disks in file system
 -A  yes            Automatic mount option
 -o  none           Additional mount options
 -T  /srv           Default mount point
bioinfo4:~ # df -h /srv
Filesystem            Size  Used Avail Use% Mounted on
/dev/gpfs1            452G  377G   76G  84% /srv


01/28/08   21:00:12 Scheduler has been started by Dsmcad.
01/28/08   21:00:12 Querying server for next scheduled event.
01/28/08   21:00:12 Node Name: BIOINFO4
01/28/08   21:00:12 Session established with server GALAHAD: Linux/i386
01/28/08   21:00:12   Server Version 5, Release 3, Level 6.0
01/28/08   21:00:12   Server date/time: 01/28/08   21:00:12  Last
access: 01/28/08   20:26:46

01/28/08   21:00:12 --- SCHEDULEREC QUERY BEGIN
01/28/08   21:00:12 --- SCHEDULEREC QUERY END
01/28/08   21:00:12 Next operation scheduled:
01/28/08   21:00:12 ------------------------------------------------------------
01/28/08   21:00:12 Schedule Name:         21_SCHED_18
01/28/08   21:00:12 Action:                Incremental
01/28/08   21:00:12 Objects:
01/28/08   21:00:12 Options:
01/28/08   21:00:12 Server Window Start:   21:00:00 on 01/28/08
01/28/08   21:00:12 ------------------------------------------------------------
01/28/08   21:00:12
Executing scheduled command now.
01/28/08   21:00:12 --- SCHEDULEREC OBJECT BEGIN 21_SCHED_18 01/28/08   21:00:00
01/28/08   21:00:12 Incremental backup of volume '/'
01/28/08   21:00:12 Incremental backup of volume '/boot'
01/28/08   21:00:12 Incremental backup of volume '/csminstall'
01/28/08   21:00:12 Incremental backup of volume '/home'
01/28/08   21:00:12 Incremental backup of volume '/srv'
<snip>
01/28/08   21:07:51 Successful incremental backup of '/boot'
<snip>
01/28/08   21:08:05 Successful incremental backup of '/'
<snip>
01/28/08   21:09:53 Successful incremental backup of '/csminstall'
<snip>
01/28/08   23:59:45 ANS1802E Incremental backup of '/home' finished
with 1 failure
<snip>
01/29/08   00:00:01 Normal File-->            59,008 /srv/group.quota
[Sent]
01/29/08   00:00:01 Normal File-->           262,144 /srv/user.quota
[Sent]
01/29/08   00:00:01 Normal File-->             8,109
/srv/LogShared/apache2/access_log [Sent]
<snip>
01/29/08   02:28:31 Normal File-->         1,268,946
/srv/databases/unigeneU/Hs.lib.info [Sent]
01/29/08   02:28:46 Normal File-->       221,453,209
/srv/databases/unigeneU/Hs.profiles [Sent]
01/29/08   02:29:04 Normal File-->       694,651,680
/srv/databases/unigeneU/Hs.data [Sent]
01/29/08   02:29:28 Normal File-->       684,135,874
/srv/databases/unigeneU/Hs.retired.lst [Sent]
01/29/08   02:29:28 ANS1999E Incremental processing of '/srv' stopped.
01/29/08   02:29:28 --- SCHEDULEREC STATUS BEGIN
01/29/08   02:29:28 Total number of objects inspected: 3,039,708
01/29/08   02:29:28 Total number of objects backed up:  559,287
01/29/08   02:29:28 Total number of objects updated:          1
01/29/08   02:29:28 Total number of objects rebound:          0
01/29/08   02:29:28 Total number of objects deleted:          0
01/29/08   02:29:28 Total number of objects expired:         95
01/29/08   02:29:28 Total number of objects failed:           1
01/29/08   02:29:28 Total number of bytes transferred:    70.16 GB
01/29/08   02:29:28 Data transfer time:                6,053.40 sec
01/29/08   02:29:28 Network data transfer rate:        12,153.50 KB/sec
01/29/08   02:29:28 Aggregate data transfer rate:      3,723.94 KB/sec
01/29/08   02:29:28 Objects compressed by:                    0%
01/29/08   02:29:28 Elapsed processing time:           05:29:15
01/29/08   02:29:28 --- SCHEDULEREC STATUS END
01/29/08   02:29:28 ANS1028S An internal program error occurred.
01/29/08   02:29:28 --- SCHEDULEREC OBJECT END 21_SCHED_18 01/28/08   21:00:00
01/29/08   02:29:28 ANS1512E Scheduled event '21_SCHED_18' failed.
Return code = 12.
01/29/08   02:29:28 Sending results for scheduled event '21_SCHED_18'.
01/29/08   02:29:29 Results sent to server for scheduled event '21_SCHED_18'.

--
Warm regards,
Michael Green