>
> On Tue, December 14, 2010 11:48 am, Robert Wirth wrote:
> > Hi,
> >
> > strange problem. Here's some hardware where Bacula has been running
> > successfully for ca. 5 years. It was release 1.38.11 under Solaris 10x86.
> >
> > Last month, we had a system disk crash on the backup system. No backup
> > datas have been lost. We just had to reinstall the backup system.
> > Since this was our only Solaris x86 system, we decided to migrate
> > to Linux and to a newer Bacula release. Until the repaired hardware
> > was present, we started with a virtualized new system, just for the
> > daily incremental backups to disk volumes.
> >
> > Since most of our actual systems are Ubuntu Hardy server LTS, we
> > choosed Bacula 2.2.8 of this distribution as our new version (well,
> > it's old, but 1.38.11 was running well, and 2.2.8 was the default)
> >
> > We upgraded Bacula's mysql database with the corresponding script
> > from 1.38.11 to 2.2.8. We imported the updated DB using mysql_dump
> > into the new system which has MySQL 5.1.41 and Linux Kernel 2.6.32
> > The virtualized system worked well all the time.
> >
> > Now, the hardware version of the system is ready, and a yearly full
> > backup, which goes directly to tape, is imminent.
> >
> > And now, the strange things are coming...
> >
> >
> > /* The system is a 2x2 core AMD Opteron system, 4 GB RAM, 6xLSI SCSI U320
> > Megaraid with seperated channels for external disks, tape readers and
> > autochanger. 23 TB disk storage on external RAIDs, autochanger and
> > HP-readers for LTO-3 tapes. System: see above. */
> >
> >
> > NOW BACKING UP...
> >
> > Starting a bunch of full backup jobs which fit into 1 SINGLE TAPE
> > produces NO PROBLEMS: the jobs start, run and write, and terminate
> > within a usual span of time. In so doing, I can backup a dozen
> > systems with totally 360 GB on one tape in a few hours.
> >
> >
> > FACING THE PROBLEM...
> >
> > Starting a bunch of full backup jobs that DO NOT FIT into 1 single
> > tape proceeds like follows (with a fresh tape forced by setting the
> > former one to readonly):
> >
> > - first, the jobs run well and write their data to the first fresh tape
> > of the corresponding pool. Speed is similar as known from the old OS.
> >
> > - when the tape is full with around 600GB of data, it is marked as
> > Full, being unloaded, and the next free tape of the pool is loaded.
> >
> > - from this moment on, writing to the new fresh tape becomes incredibly
> > slow (4 GB/hour) and mysqld has constantly 95%-100% CPU load.
> > No other process has an important load, and the mysql load isn't
> > represented in the system's load values:
> >
> > Cpu(s): 3.3%us, 2.2%sy, 0.0%ni, 91.6%id, 2.1%wa, 0.1%hi, 0.7%si,
> > 0.0%st
> > Mem: 3961616k total, 3850072k used, 111544k free, 17532k buffers
> > Swap: 3906552k total, 0k used, 3906552k free, 3579956k cached
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 1356 mysql 20 0 144m 31m 2376 S 98 0.8 163:57.79 mysqld
> > 1 root 20 0 2620 948 528 S 0 0.0 0:00.63 init
> > 2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd
> > ....
> >
> > The only further effect I can see is that the table "bacula.JobMedia" is
> > growing. No errors in system log, no mysql errors, nor in Baculas log.
> >
> > What I mainly don't understand is why this happens after a tape change.
> > The MaxSpoolSize is 32GB, and I'm backing up 7 systems. Each of them
> > had several spool steps during the first tape.
> >
> >>From the view of Bacula and its program logic, what has changed when
> > the tape has been changed? I guess it's all the same: spooling data,
> > writing them to tape and update the catalog, regardless of first, second
> > or later tape...?!?
>
> What do you see under Running Jobs in the 'status dir' output before and
> after the first tape has filled?
>
> If you have only the 'after' just now, that might be interesting.
Will try to see this on next try.
/* Actual state: the second tape is loaded and used since yesterday
02:14 p.m., and it really has been written 29 GB :-((( */
And, I wrote wrong. There IS something strange in bacula log. Oooh.
Yesterday, I started the 7 not-fitting big backup jobs around 9:45 a.m.
The first tape has been filled with 540 GB until 02:45 p.m.
| 833 | 000090L1 | Full | 1 | 543,602,949,120 | 544 |
63,072,000 | 0 | 1 | 1 | LTO-3 | 2010-12-14 14:45:52 |
During this time, I could follow the spooling via messages on bconsole.
It all looked correct: parallel spooling and writing of data.
/* PARENTHESIS:
Actual state: the second tape was loaded yesterday 02:45 p.m.,
and it really has been written 29 GB since then during 18 hours :-(((
| 834 | 000098L1 | Append | 1 | 29,998,080,000 | 30 |
63,072,000 | 0 | 4 | 1 | LTO-3 | 2010-12-15 08:14:46 |
END OF PARENTHESIS */
In the log, there's only information about the job which I started
first (lnv-102). No word about the other 6 jobs and their spooling!
And the logfile is interrupted at 01:22 p.m., more than one hour before
the tape was full and changed:
14-Dec 09:44 bup-serv-dir JobId 93518: Start Backup JobId 93518,
Job=lnv-102.2010-12-14_09.44.00
14-Dec 09:46 bup-serv-dir JobId 93518: Using Device "Tape1Left"
14-Dec 09:47 bup-serv-sd JobId 93518: Spooling data ...
14-Dec 10:21 bup-serv-sd JobId 93518: User specified spool size reached.
14-Dec 10:21 bup-serv-sd JobId 93518: Writing spooled data to Volume.
Despooling 32,000,031,721 bytes ...
14-Dec 11:24 bup-serv-sd JobId 93518: Despooling elapsed time = 00:17:28,
Transfer rate = 30.53 M bytes/second
14-Dec 11:24 bup-serv-sd JobId 93518: Spooling data again ...
14-Dec 11:43 bup-serv-sd JobId 93518: User specified spool size reached.
14-Dec 11:43 bup-serv-sd JobId 93518: Writing spooled data to Volume.
Despooling 32,000,032,040 bytes ...
14-Dec 12:58 bup-serv-sd JobId 93518: Despooling elapsed time = 00:14:45,
Transfer rate = 36.15 M bytes/second
14-Dec 12:58 bup-serv-sd JobId 93518: Spooling data again ...
14-Dec 13:22 bup-serv-sd JobId 93518: User specified spool size
[END OF FILE]
During this time, 17 spool sessions of all 7 systems, each 32 GB, have
been written to tape, not just 2 spools of one system, as logged here.
"status dir" tells me that this one job is terminated, while the other
6 jobs are still marked as running:
93518 Full lnv-102.2010-12-14_09.44.00 has terminated
But, nevertheless, the last spool file (25GB) of this job is still
present in spool directory:
-rw-r----- 1 root bacula 25590119686 2010-12-14 14:48
bup-serv-sd.data.93518.lnv-102.2010-12-14_09.44.00.Tape1Left.spool
And the spool metadata file has just 100MB:
-rw-r----- 1 root bacula 103063552 2010-12-14 14:48
bup-serv-sd.attr.lnv-102.2010-12-14_09.44.00.5.spool
bacula-sd helds these two files open, as lsof shows:
bacula-sd 2150 root 13u REG 251,5 25590119686 13
/var/services/bacula/spool1/bup-serv-sd.data.93518.lnv-102.2010-12-14_09.44.00.Tape1Left.spool
bacula-sd 2150 root 14u REG 251,1 103063552 131160
/var/services/bacula/work/bup-serv-sd.attr.lnv-102.2010-12-14_09.44.00.5.spool
but no more connection to the FD of that system. This is ok, since all
data has been spooled.
For the 6 other jobs, connections AND spool files are present. Correct.
Spooling of lnv-102 is terminated, but the last spool isn't yet written
to tape. Then, t here should be NO HEAVY mysql operations yet, since no
job is completely despooled yet. Right?
Where comes the mysql load from? Reminder that the bunch of tape-fitting
jobs wrote 360GB in a few hours, inclusively DB operations.
???
Next, I will stop the slow processing, purge all sick data.
Then start just one backup job which doesn't fit on one tape,
and I will look what happens. I will try to look the state before
and after the tape change.
Regards,
Robert
----------------------------------------------------------------
+++ German Research Center for Artificial Intelligence +++
----------------------------------------------------------------
Dipl.-Inform. Robert V. Wirth, Campus D3_2, D-66123 Saarbruecken
@office: +49-681-85775-5078 / -5572 +++ @fax: +49-681-85775-5020
mailto:Robert.Wirth AT dfki DOT de ++++++++++ http://www.dfki.de/~wirth
----------------------------------------------------------------
------------------------------------------------------------------------------
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users
|