Bacula-users

[Bacula-users] Disk volume management - catalog<>disk discrepencies

2010-05-18 00:42:57
Subject: [Bacula-users] Disk volume management - catalog<>disk discrepencies
From: Craig Ringer <craig AT postnewspapers.com DOT au>
To: bacula-users <bacula-users AT lists.sourceforge DOT net>
Date: Tue, 18 May 2010 12:38:52 +0800
Hi folks

I've been having a lot of problems with Bacula's disk volume management
over time. Most issues seem to stem from cases where the catalog gets
out of sync with the file system, like:

- Volume inserted into catalog, creation of file on disk fails, volume
  remains in catalog

- Write to volume fails, resulting in volume that's shorter on disk than
  the catalog thinks it should be because the catalog is updated with
  the *expected* size assuming the write is successful. Further backups
  try to use this volume and fail. Restores from this volume fail. ARgh!

The latest issue is another catalog<>disk discrepency. At some point
Bacula seems to have failed to create some auto-labeled volumes, but has
still inserted them into the catalog. This might've been due to a
transient disk-full situation, but it's truly hard to know.

The catalog shows some volumes that don't actually exist on disk:


> +---------+------------------------+-----------+---------+---------------+----------+--------------+---------+------+-----------+--------------------+---------------------+
> | mediaid | volumename             | volstatus | enabled | volbytes      | 
> volfiles | volretention | recycle | slot | inchanger | mediatype          | 
> lastwritten         |
> +---------+------------------------+-----------+---------+---------------+----------+--------------+---------+------+-----------+--------------------+---------------------+
> |     158 | HotProductionIncr-0158 | Recycle   |       1 |             1 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction | 
> 2010-04-21 17:46:51 |
> |     165 | HotProductionIncr-0165 | Used      |       1 | 2,129,594,337 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction | 
> 2010-04-22 14:00:22 |
> |     172 | HotProductionIncr-0172 | Used      |       1 |   486,771,914 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction | 
> 2010-04-23 13:00:31 |
> |     178 | HotProductionIncr-0178 | Used      |       1 |     2,698,541 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction | 
> 2010-04-24 13:00:19 |
> |     186 | HotProductionIncr-0186 | Used      |       1 |       622,259 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction | 
> 2010-04-24 23:05:16 |
> |     196 | HotProductionIncr-0196 | Used      |       1 |     1,240,182 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction | 
> 2010-04-26 23:05:09 |
> |     202 | HotProductionIncr-0202 | Used      |       1 | 1,634,621,778 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction | 
> 2010-04-28 13:00:21 |
> |     203 | HotProductionIncr-0203 | Used      |       1 | 3,037,665,985 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction | 
> 2010-04-29 14:00:23 |
> |     204 | HotProductionIncr-0204 | Used      |       1 |   695,723,567 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction | 
> 2010-04-30 13:00:55 |
> |     205 | HotProductionIncr-0205 | Used      |       1 | 8,473,730,759 |    
>     1 |    1,209,600 |       1 |    0 |         0 | File_HotProduction | 
> 2010-05-06 14:01:11 |
> |     206 | HotProductionIncr-0206 | Used      |       1 | 4,679,398,060 |    
>     1 |    1,209,600 |       1 |    0 |         0 | File_HotProduction | 
> 2010-05-07 13:00:55 |
> |     207 | HotProductionIncr-0207 | Used      |       1 |   107,414,349 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction | 
> 2010-05-08 23:05:05 |
> |     208 | HotProductionIncr-0208 | Used      |       1 |       519,455 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction | 
> 2010-05-08 23:05:48 |
> |     209 | HotProductionIncr-0209 | Used      |       1 |   711,016,317 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction | 
> 2010-05-10 23:06:23 |
> |     210 | HotProductionIncr-0210 | Used      |       1 | 1,723,979,638 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction | 
> 2010-05-12 10:01:34 |
> |     211 | HotProductionIncr-0211 | Used      |       1 | 5,953,774,081 |    
>     1 |    1,209,600 |       1 |    0 |         0 | File_HotProduction | 
> 2010-05-13 12:02:58 |
> |     212 | HotProductionIncr-0212 | Append    |       1 |             0 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction |      
>                |
> |     213 | HotProductionIncr-0213 | Append    |       1 |             0 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction |      
>                |
> |     214 | HotProductionIncr-0214 | Append    |       1 |             0 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction |      
>                |
> |     215 | HotProductionIncr-0215 | Append    |       1 |             0 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction |      
>                |
> |     216 | HotProductionIncr-0216 | Append    |       1 |             0 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction |      
>                |
> |     217 | HotProductionIncr-0217 | Append    |       1 |             0 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction |      
>                |
> |     218 | HotProductionIncr-0218 | Append    |       1 |             0 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction |      
>                |
> |     219 | HotProductionIncr-0219 | Append    |       1 |             0 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction |      
>                |
> |     220 | HotProductionIncr-0220 | Append    |       1 |             0 |    
>     0 |    1,209,600 |       1 |    0 |         0 | File_HotProduction |      
>                |
> +---------+------------------------+-----------+---------+---------------+----------+--------------+---------+------+-----------+--------------------+---------------------+


where any with "lastwritten" blank, volbytes 0, volfiles 0 are missing
in the file system:

> craig@backup:$ sudo ls -l /srv/bacula-main/HotProduction
> total 603766036
> -rw-r----- 1 bacula tape 168740858003 2010-04-20 17:37 HotProductionFull-0154
> -rw-r----- 1 bacula tape 166859921967 2010-04-26 03:13 HotProductionFull-0191
> -rw-r----- 1 bacula tape  42450993346 2010-05-05 19:18 HotProductionFull-0199
> -rw-r----- 1 bacula tape 206717382464 2010-05-10 03:38 HotProductionFull-0250
> -rw-r----- 1 bacula tape   3848435102 2010-04-21 17:46 HotProductionIncr-0158
> -rw-r----- 1 bacula tape   2129594337 2010-04-22 14:00 HotProductionIncr-0165
> -rw-r----- 1 bacula tape    486771914 2010-04-23 13:00 HotProductionIncr-0172
> -rw-r----- 1 bacula tape      2698541 2010-04-24 13:00 HotProductionIncr-0178
> -rw-r----- 1 bacula tape       622259 2010-04-24 23:05 HotProductionIncr-0186
> -rw-r----- 1 bacula tape      1240182 2010-04-26 23:05 HotProductionIncr-0196
> -rw-r----- 1 bacula tape   1634621778 2010-04-28 13:00 HotProductionIncr-0202
> -rw-r----- 1 bacula tape   3037665985 2010-04-29 14:00 HotProductionIncr-0203
> -rw-r----- 1 bacula tape    695723567 2010-04-30 13:00 HotProductionIncr-0204
> -rw-r----- 1 bacula tape   8473730759 2010-05-06 14:01 HotProductionIncr-0205
> -rw-r----- 1 bacula tape   4679398060 2010-05-07 13:00 HotProductionIncr-0206
> -rw-r----- 1 bacula tape    107414349 2010-05-08 23:05 HotProductionIncr-0207
> -rw-r----- 1 bacula tape       519455 2010-05-08 23:05 HotProductionIncr-0208
> -rw-r----- 1 bacula tape    711016317 2010-05-10 23:06 HotProductionIncr-0209
> -rw-r----- 1 bacula tape   1723979638 2010-05-12 10:01 HotProductionIncr-0210
> -rw-r----- 1 bacula tape   5953774081 2010-05-13 12:02 HotProductionIncr-0211


and when a job tries to use them, it seems to get stuck, with the sd
periodically reporting the error:

18-May 12:20 backup-sd JobId 854: Warning: mount.c:221 Open device
"FileStorage_HotProduction" (/srv/bacula-main/HotProduction) Volume
"HotProductionIncr-0212" failed: ERR=dev.c:549 Could not open:
/srv/bacula-main/HotProduction/HotProductionIncr-0212, ERR=No such file
or directory

leaving a huge queue of jobs that use the stame storage stuck behind it:


>    853 Increme  BucketActiveData.2010-05-13_13.00.00_05 is running
>    854 Increme  FlouryProductionAndPongrassData.2010-05-13_13.00.00_06 is 
> waiting for a mount request
>    855 Increme  PascalProductionAndPongrassData.2010-05-13_13.00.00_07 is 
> running
>    856 Increme  BucketActiveData.2010-05-13_14.00.00_08 is waiting on max 
> Client jobs
>    857 Increme  FlouryProductionAndPongrassData.2010-05-13_14.00.00_09 is 
> waiting on max Client jobs
>    858 Increme  PascalProductionAndPongrassData.2010-05-13_14.00.00_10 is 
> waiting on max Client jobs
>    861 Increme  BucketActiveData.2010-05-13_23.05.00_13 is waiting on max 
> Client jobs
>    862 Increme  FlouryProductionAndPongrassData.2010-05-13_23.05.00_14 is 
> waiting on max Client jobs
>    863 Increme  PascalProductionAndPongrassData.2010-05-13_23.05.00_15 is 
> waiting on max Client jobs
>    867 Increme  BucketArchivalData.2010-05-13_23.11.00_19 is waiting on max 
> Client jobs
>    871 Increme  BucketSystem.2010-05-13_23.15.00_23 is waiting on max Client 
> jobs
>    873 Increme  FlourySystem.2010-05-13_23.15.00_25 is waiting on max Client 
> jobs
>    875 Full    BackupCatalog.2010-05-13_23.20.00_27 is waiting for higher 
> priority jobs to finish
>    876 Increme  BucketActiveData.2010-05-14_13.00.00_28 is waiting on max 
> Client jobs
>    877 Increme  FlouryProductionAndPongrassData.2010-05-14_13.00.00_29 is 
> waiting on max Client jobs
>    878 Increme  PascalProductionAndPongrassData.2010-05-14_13.00.00_30 is 
> waiting on max Client jobs
>    881 Increme  BucketActiveData.2010-05-14_23.05.00_33 is waiting on max 
> Client jobs
>    882 Increme  FlouryProductionAndPongrassData.2010-05-14_23.05.00_34 is 
> waiting on max Client jobs
>    883 Increme  PascalProductionAndPongrassData.2010-05-14_23.05.00_35 is 
> waiting on max Client jobs
>    887 Increme  BucketArchivalData.2010-05-14_23.11.00_39 is waiting on max 
> Client jobs
>    891 Increme  BucketSystem.2010-05-14_23.15.00_43 is waiting on max Client 
> jobs
>    893 Increme  FlourySystem.2010-05-14_23.15.00_45 is waiting on max Client 
> jobs
>    895 Full    BackupCatalog.2010-05-14_23.20.00_47 is waiting execution
>    896 VolumeT  VerifyBucketArchivalData.2010-05-15_12.00.00_48 is waiting on 
> max Client jobs
>    898 Increme  BucketActiveData.2010-05-15_13.00.00_50 is waiting on max 
> Client jobs
>    899 Increme  FlouryProductionAndPongrassData.2010-05-15_13.00.00_51 is 
> waiting on max Client jobs
>    900 Increme  PascalProductionAndPongrassData.2010-05-15_13.00.00_52 is 
> waiting on max Client jobs
>    903 Increme  BucketActiveData.2010-05-15_23.05.00_55 is waiting on max 
> Client jobs
>    904 Increme  FlouryProductionAndPongrassData.2010-05-15_23.05.00_56 is 
> waiting on max Client jobs
>    905 Increme  PascalProductionAndPongrassData.2010-05-15_23.05.00_57 is 
> waiting on max Client jobs
>    909 Increme  BucketArchivalData.2010-05-15_23.11.00_01 is waiting on max 
> Client jobs
>    913 Increme  BucketSystem.2010-05-15_23.15.00_05 is waiting on max Client 
> jobs
>    915 Increme  FlourySystem.2010-05-15_23.15.00_07 is waiting on max Client 
> jobs
>    917 Full    BackupCatalog.2010-05-15_23.20.00_09 is waiting execution
>    918 Full    BucketActiveData.2010-05-16_23.05.00_11 is waiting on max 
> Client jobs
>    919 Full    FlouryProductionAndPongrassData.2010-05-16_23.05.00_12 is 
> waiting on max Client jobs
>    920 Full    PascalProductionAndPongrassData.2010-05-16_23.05.00_13 is 
> waiting on max Client jobs
>    927 Differe  BucketSystem.2010-05-16_23.15.00_20 is waiting on max Client 
> jobs
>    929 Differe  FlourySystem.2010-05-16_23.15.00_22 is waiting on max Client 
> jobs
>    931 Full    BackupCatalog.2010-05-16_23.20.00_24 is waiting execution
>    932 Increme  BucketActiveData.2010-05-17_13.00.00_25 is waiting on max 
> Client jobs
>    933 Increme  FlouryProductionAndPongrassData.2010-05-17_13.00.00_26 is 
> waiting on max Client jobs
>    934 Increme  PascalProductionAndPongrassData.2010-05-17_13.00.00_27 is 
> waiting on max Client jobs
>    937 Increme  BucketActiveData.2010-05-17_23.05.00_30 is waiting on max 
> Client jobs
>    938 Increme  FlouryProductionAndPongrassData.2010-05-17_23.05.00_31 is 
> waiting on max Client jobs
>    939 Increme  PascalProductionAndPongrassData.2010-05-17_23.05.00_32 is 
> waiting on max Client jobs
>    943 Increme  BucketArchivalData.2010-05-17_23.11.00_37 is waiting on max 
> Client jobs
>    947 Increme  BucketSystem.2010-05-17_23.15.00_41 is waiting on max Client 
> jobs
>    949 Increme  FlourySystem.2010-05-17_23.15.00_43 is waiting on max Client 
> jobs
>    951 Full    BackupCatalog.2010-05-17_23.20.00_45 is waiting execution




I'm not sure I understand the point of the SD's behaviour here.
Shouldn't it fail with an error if it can't find the volume? Or (IMO
even better) simply create it?

For that matter, shouldn't volume creation be done during an open
transaction with PostgreSQL, so that the volume record isn't actually
committed into the database until the file system operations to label it
have succeeded?

-- 
Craig Ringer

Tech-related writing: http://soapyfrogs.blogspot.com/

------------------------------------------------------------------------------

_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users

<Prev in Thread] Current Thread [Next in Thread>