Re: [Bacula-users] backup slowdown (mysqld) after tape autochange

> 
> On Tue, December 14, 2010 11:48 am, Robert Wirth wrote:
> > Hi,
> >
> > strange problem.  Here's some hardware where Bacula has been running
> > successfully for ca. 5 years.  It was release 1.38.11 under Solaris 10x86.
> >
> > Last month, we had a system disk crash on the backup system.  No backup
> > datas have been lost.  We just had to reinstall the backup system.
> > Since this was our only Solaris x86 system, we decided to migrate
> > to Linux and to a newer Bacula release.  Until the repaired hardware
> > was present, we started with a virtualized new system, just for the
> > daily incremental backups to disk volumes.
> >
> > Since most of our actual systems are Ubuntu Hardy server LTS, we
> > choosed Bacula 2.2.8 of this distribution as our new version (well,
> > it's old, but 1.38.11 was running well, and 2.2.8 was the default)
> >
> > We upgraded Bacula's mysql database with the corresponding script
> > from 1.38.11 to 2.2.8.  We imported the updated DB using mysql_dump
> > into the new system which has MySQL 5.1.41 and Linux Kernel 2.6.32
> > The virtualized system worked well all the time.
> >
> > Now, the hardware version of the system is ready, and a yearly full
> > backup, which goes directly to tape, is imminent.
> >
> > And now, the strange things are coming...
> >
> >
> > /* The system is a 2x2 core AMD Opteron system, 4 GB RAM, 6xLSI SCSI U320
> > Megaraid with seperated channels for external disks, tape readers and
> > autochanger.  23 TB disk storage on external RAIDs, autochanger and
> > HP-readers for LTO-3 tapes.   System: see above. */
> >
> >
> > NOW BACKING UP...
> >
> > Starting a bunch of full backup jobs which fit into 1 SINGLE TAPE
> > produces NO PROBLEMS:  the jobs start, run and write, and terminate
> > within a usual span of time.  In so doing, I can backup a dozen
> > systems with totally 360 GB on one tape in a few hours.
> >
> >
> > FACING THE PROBLEM...
> >
> > Starting a bunch of full backup jobs that DO NOT FIT into 1 single
> > tape proceeds like follows  (with a fresh tape forced by setting the
> > former one to readonly):
> >
> > - first, the jobs run well and write their data to the first fresh tape
> >   of the corresponding pool.  Speed is similar as known from the old OS.
> >
> > - when the tape is full with around 600GB of data, it is marked as
> >   Full, being unloaded, and the next free tape of the pool is loaded.
> >
> > - from this moment on, writing to the new fresh tape becomes incredibly
> >   slow (4 GB/hour) and mysqld has constantly 95%-100% CPU load.
> >   No other process has an important load, and the mysql load isn't
> >   represented in the system's load values:
> >
> > Cpu(s):  3.3%us,  2.2%sy,  0.0%ni, 91.6%id,  2.1%wa,  0.1%hi,  0.7%si,
> > 0.0%st
> > Mem:   3961616k total,  3850072k used,   111544k free,    17532k buffers
> > Swap:  3906552k total,        0k used,  3906552k free,  3579956k cached
> >
> >   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >  1356 mysql     20   0  144m  31m 2376 S   98  0.8 163:57.79 mysqld
> >     1 root      20   0  2620  948  528 S    0  0.0   0:00.63 init
> >     2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd
> >  ....
> >
> > The only further effect I can see is that the table "bacula.JobMedia" is
> > growing.   No errors in system log, no mysql errors, nor in Baculas log.
> >
> > What I mainly don't understand is why this happens after a tape change.
> > The MaxSpoolSize is 32GB, and I'm backing up 7 systems.  Each of them
> > had several spool steps during the first tape.
> >
> >>From the view of Bacula and its program logic, what has changed when
> > the tape has been changed?  I guess it's all the same:  spooling data,
> > writing them to tape and update the catalog, regardless of first, second
> > or later tape...?!?
> 
> What do you see under Running Jobs in the 'status dir' output before and
> after the first tape has filled?
> 
> If you have only the 'after' just now, that might be interesting.

Will try to see this on next try.

/* Actual state:  the second tape is loaded and used since yesterday 
02:14 p.m., and it really has been written 29 GB :-((( */


And, I wrote wrong.  There IS something strange in bacula log.  Oooh.

Yesterday, I started the 7 not-fitting big backup jobs around 9:45 a.m.  
The first tape has been filled with 540 GB until 02:45 p.m.  

|     833 | 000090L1   | Full      |       1 | 543,602,949,120 |      544 |   
63,072,000 |       0 |    1 |         1 | LTO-3     | 2010-12-14 14:45:52 |

During this time, I could follow the spooling via messages on bconsole. 
It all looked correct:  parallel spooling and writing of data.

/* PARENTHESIS: 
Actual state:  the second tape was loaded yesterday 02:45 p.m., 
and it really has been written 29 GB since then during 18 hours :-((( 
|     834 | 000098L1   | Append    |       1 |  29,998,080,000 |       30 |   
63,072,000 |       0 |    4 |         1 | LTO-3     | 2010-12-15 08:14:46 |
END OF PARENTHESIS */


In the log, there's only information about the job which I started
first (lnv-102).  No word about the other 6 jobs and their spooling!  
And the logfile is interrupted at 01:22 p.m., more than one hour before 
the tape was full and changed:

14-Dec 09:44 bup-serv-dir JobId 93518: Start Backup JobId 93518, 
Job=lnv-102.2010-12-14_09.44.00
14-Dec 09:46 bup-serv-dir JobId 93518: Using Device "Tape1Left"
14-Dec 09:47 bup-serv-sd JobId 93518: Spooling data ...
14-Dec 10:21 bup-serv-sd JobId 93518: User specified spool size reached.
14-Dec 10:21 bup-serv-sd JobId 93518: Writing spooled data to Volume. 
Despooling 32,000,031,721 bytes ...
14-Dec 11:24 bup-serv-sd JobId 93518: Despooling elapsed time = 00:17:28, 
Transfer rate = 30.53 M bytes/second
14-Dec 11:24 bup-serv-sd JobId 93518: Spooling data again ...
14-Dec 11:43 bup-serv-sd JobId 93518: User specified spool size reached.
14-Dec 11:43 bup-serv-sd JobId 93518: Writing spooled data to Volume. 
Despooling 32,000,032,040 bytes ...
14-Dec 12:58 bup-serv-sd JobId 93518: Despooling elapsed time = 00:14:45, 
Transfer rate = 36.15 M bytes/second
14-Dec 12:58 bup-serv-sd JobId 93518: Spooling data again ...
14-Dec 13:22 bup-serv-sd JobId 93518: User specified spool size 
[END OF FILE]

During this time, 17 spool sessions of all 7 systems, each 32 GB, have 
been written to tape, not just 2 spools of one system, as logged here.

"status dir" tells me that this one job is terminated, while the other
6 jobs are still marked as running:

 93518 Full    lnv-102.2010-12-14_09.44.00 has terminated

But, nevertheless, the last spool file (25GB) of this job is still
present in spool directory:

-rw-r----- 1 root bacula 25590119686 2010-12-14 14:48 
bup-serv-sd.data.93518.lnv-102.2010-12-14_09.44.00.Tape1Left.spool

And the spool metadata file has just 100MB:

-rw-r----- 1 root bacula 103063552 2010-12-14 14:48 
bup-serv-sd.attr.lnv-102.2010-12-14_09.44.00.5.spool

bacula-sd helds these two files open, as lsof shows:

bacula-sd 2150 root   13u   REG  251,5 25590119686     13 
/var/services/bacula/spool1/bup-serv-sd.data.93518.lnv-102.2010-12-14_09.44.00.Tape1Left.spool
bacula-sd 2150 root   14u   REG  251,1   103063552 131160 
/var/services/bacula/work/bup-serv-sd.attr.lnv-102.2010-12-14_09.44.00.5.spool

but no more connection to the FD of that system.  This is ok, since all
data has been spooled.
For the 6 other jobs, connections AND spool files are present.  Correct.


Spooling of lnv-102 is terminated, but the last spool isn't yet written
to tape.   Then, t here should be NO HEAVY mysql operations yet, since no 
job is completely despooled yet.  Right?

Where comes the mysql load from?  Reminder that the bunch of tape-fitting
jobs wrote 360GB in a few hours, inclusively DB operations.

???


Next, I will stop the slow processing, purge all sick data.  
Then start just one backup job which doesn't fit on one tape, 
and I will look what happens.  I will try to look the state before 
and after the tape change.



Regards,

        Robert

----------------------------------------------------------------
+++    German Research Center for Artificial Intelligence    +++
----------------------------------------------------------------
Dipl.-Inform. Robert V. Wirth, Campus D3_2, D-66123 Saarbruecken
@office: +49-681-85775-5078 / -5572 +++ @fax: +49-681-85775-5020
mailto:Robert.Wirth AT dfki DOT de ++++++++++ http://www.dfki.de/~wirth
----------------------------------------------------------------



------------------------------------------------------------------------------
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users