>>>>> On Wed, 15 Dec 2010 10:15:09 +0100, Robert Wirth said:
>
> >
> > On Tue, December 14, 2010 11:48 am, Robert Wirth wrote:
> > > Hi,
> > >
> > > strange problem. Here's some hardware where Bacula has been running
> > > successfully for ca. 5 years. It was release 1.38.11 under Solaris 10x86.
> > >
> > > Last month, we had a system disk crash on the backup system. No backup
> > > datas have been lost. We just had to reinstall the backup system.
> > > Since this was our only Solaris x86 system, we decided to migrate
> > > to Linux and to a newer Bacula release. Until the repaired hardware
> > > was present, we started with a virtualized new system, just for the
> > > daily incremental backups to disk volumes.
> > >
> > > Since most of our actual systems are Ubuntu Hardy server LTS, we
> > > choosed Bacula 2.2.8 of this distribution as our new version (well,
> > > it's old, but 1.38.11 was running well, and 2.2.8 was the default)
> > >
> > > We upgraded Bacula's mysql database with the corresponding script
> > > from 1.38.11 to 2.2.8. We imported the updated DB using mysql_dump
> > > into the new system which has MySQL 5.1.41 and Linux Kernel 2.6.32
> > > The virtualized system worked well all the time.
> > >
> > > Now, the hardware version of the system is ready, and a yearly full
> > > backup, which goes directly to tape, is imminent.
> > >
> > > And now, the strange things are coming...
> > >
> > >
> > > /* The system is a 2x2 core AMD Opteron system, 4 GB RAM, 6xLSI SCSI U320
> > > Megaraid with seperated channels for external disks, tape readers and
> > > autochanger. 23 TB disk storage on external RAIDs, autochanger and
> > > HP-readers for LTO-3 tapes. System: see above. */
> > >
> > >
> > > NOW BACKING UP...
> > >
> > > Starting a bunch of full backup jobs which fit into 1 SINGLE TAPE
> > > produces NO PROBLEMS: the jobs start, run and write, and terminate
> > > within a usual span of time. In so doing, I can backup a dozen
> > > systems with totally 360 GB on one tape in a few hours.
> > >
> > >
> > > FACING THE PROBLEM...
> > >
> > > Starting a bunch of full backup jobs that DO NOT FIT into 1 single
> > > tape proceeds like follows (with a fresh tape forced by setting the
> > > former one to readonly):
> > >
> > > - first, the jobs run well and write their data to the first fresh tape
> > > of the corresponding pool. Speed is similar as known from the old OS.
> > >
> > > - when the tape is full with around 600GB of data, it is marked as
> > > Full, being unloaded, and the next free tape of the pool is loaded.
> > >
> > > - from this moment on, writing to the new fresh tape becomes incredibly
> > > slow (4 GB/hour) and mysqld has constantly 95%-100% CPU load.
> > > No other process has an important load, and the mysql load isn't
> > > represented in the system's load values:
> > >
> > > Cpu(s): 3.3%us, 2.2%sy, 0.0%ni, 91.6%id, 2.1%wa, 0.1%hi, 0.7%si,
> > > 0.0%st
> > > Mem: 3961616k total, 3850072k used, 111544k free, 17532k buffers
> > > Swap: 3906552k total, 0k used, 3906552k free, 3579956k cached
> > >
> > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > > 1356 mysql 20 0 144m 31m 2376 S 98 0.8 163:57.79 mysqld
> > > 1 root 20 0 2620 948 528 S 0 0.0 0:00.63 init
> > > 2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd
> > > ....
> > >
> > > The only further effect I can see is that the table "bacula.JobMedia" is
> > > growing. No errors in system log, no mysql errors, nor in Baculas log.
> > >
> > > What I mainly don't understand is why this happens after a tape change.
> > > The MaxSpoolSize is 32GB, and I'm backing up 7 systems. Each of them
> > > had several spool steps during the first tape.
> > >
> > >>From the view of Bacula and its program logic, what has changed when
> > > the tape has been changed? I guess it's all the same: spooling data,
> > > writing them to tape and update the catalog, regardless of first, second
> > > or later tape...?!?
> >
> > What do you see under Running Jobs in the 'status dir' output before and
> > after the first tape has filled?
> >
> > If you have only the 'after' just now, that might be interesting.
>
> Will try to see this on next try.
>
> /* Actual state: the second tape is loaded and used since yesterday
> 02:14 p.m., and it really has been written 29 GB :-((( */
>
>
> And, I wrote wrong. There IS something strange in bacula log. Oooh.
>
> Yesterday, I started the 7 not-fitting big backup jobs around 9:45 a.m.
> The first tape has been filled with 540 GB until 02:45 p.m.
>
> | 833 | 000090L1 | Full | 1 | 543,602,949,120 | 544 |
> 63,072,000 | 0 | 1 | 1 | LTO-3 | 2010-12-14 14:45:52 |
>
> During this time, I could follow the spooling via messages on bconsole.
> It all looked correct: parallel spooling and writing of data.
>
> /* PARENTHESIS:
> Actual state: the second tape was loaded yesterday 02:45 p.m.,
> and it really has been written 29 GB since then during 18 hours :-(((
> | 834 | 000098L1 | Append | 1 | 29,998,080,000 | 30 |
> 63,072,000 | 0 | 4 | 1 | LTO-3 | 2010-12-15 08:14:46 |
> END OF PARENTHESIS */
>
>
> In the log, there's only information about the job which I started
> first (lnv-102). No word about the other 6 jobs and their spooling!
> And the logfile is interrupted at 01:22 p.m., more than one hour before
> the tape was full and changed:
>
> 14-Dec 09:44 bup-serv-dir JobId 93518: Start Backup JobId 93518,
> Job=lnv-102.2010-12-14_09.44.00
> 14-Dec 09:46 bup-serv-dir JobId 93518: Using Device "Tape1Left"
> 14-Dec 09:47 bup-serv-sd JobId 93518: Spooling data ...
> 14-Dec 10:21 bup-serv-sd JobId 93518: User specified spool size reached.
> 14-Dec 10:21 bup-serv-sd JobId 93518: Writing spooled data to Volume.
> Despooling 32,000,031,721 bytes ...
> 14-Dec 11:24 bup-serv-sd JobId 93518: Despooling elapsed time = 00:17:28,
> Transfer rate = 30.53 M bytes/second
> 14-Dec 11:24 bup-serv-sd JobId 93518: Spooling data again ...
> 14-Dec 11:43 bup-serv-sd JobId 93518: User specified spool size reached.
> 14-Dec 11:43 bup-serv-sd JobId 93518: Writing spooled data to Volume.
> Despooling 32,000,032,040 bytes ...
> 14-Dec 12:58 bup-serv-sd JobId 93518: Despooling elapsed time = 00:14:45,
> Transfer rate = 36.15 M bytes/second
> 14-Dec 12:58 bup-serv-sd JobId 93518: Spooling data again ...
> 14-Dec 13:22 bup-serv-sd JobId 93518: User specified spool size
> [END OF FILE]
>
> During this time, 17 spool sessions of all 7 systems, each 32 GB, have
> been written to tape, not just 2 spools of one system, as logged here.
It looks like the log stream isn't being flushed to the file. Can you try the
bconsole messages command instead?
> "status dir" tells me that this one job is terminated, while the other
> 6 jobs are still marked as running:
>
> 93518 Full lnv-102.2010-12-14_09.44.00 has terminated
>
> But, nevertheless, the last spool file (25GB) of this job is still
> present in spool directory:
>
> -rw-r----- 1 root bacula 25590119686 2010-12-14 14:48
> bup-serv-sd.data.93518.lnv-102.2010-12-14_09.44.00.Tape1Left.spool
>
> And the spool metadata file has just 100MB:
>
> -rw-r----- 1 root bacula 103063552 2010-12-14 14:48
> bup-serv-sd.attr.lnv-102.2010-12-14_09.44.00.5.spool
>
> bacula-sd helds these two files open, as lsof shows:
>
> bacula-sd 2150 root 13u REG 251,5 25590119686 13
> /var/services/bacula/spool1/bup-serv-sd.data.93518.lnv-102.2010-12-14_09.44.00.Tape1Left.spool
> bacula-sd 2150 root 14u REG 251,1 103063552 131160
> /var/services/bacula/work/bup-serv-sd.attr.lnv-102.2010-12-14_09.44.00.5.spool
>
> but no more connection to the FD of that system. This is ok, since all
> data has been spooled.
> For the 6 other jobs, connections AND spool files are present. Correct.
>
>
> Spooling of lnv-102 is terminated, but the last spool isn't yet written
> to tape. Then, t here should be NO HEAVY mysql operations yet, since no
> job is completely despooled yet. Right?
>
> Where comes the mysql load from? Reminder that the bunch of tape-fitting
> jobs wrote 360GB in a few hours, inclusively DB operations.
Can you turn on mysql query logging?
__Martin
------------------------------------------------------------------------------
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
|