Bacula-users

Re: [Bacula-users] Bacula + Postgres : copy batch problem

2010-08-03 19:13:12
Subject: Re: [Bacula-users] Bacula + Postgres : copy batch problem
From: Rory Campbell-Lange <rory AT campbell-lange DOT net>
To: Marc Cousin <cousinmarc AT gmail DOT com>
Date: Wed, 4 Aug 2010 00:09:53 +0100
On 03/08/10, Marc Cousin (cousinmarc AT gmail DOT com) wrote:
> > > > 3. Why is Bacula using a batch file at all? Why not simply do a straight
> > > >    insert?
> > > 
> > > Because 7,643,966 inserts would be much slower.
> > 
> > Really? I've logged Bacula's performance on the server and the inserts
> > run at around 0.35 ms and updates at around 0.5 ms. 

> What is traced, usually, is execution time. You won't easily get :
> - Parse time of the query. It is basically 0 with batch insert, where it is 
>   very measurable with insert.
> - Round trip duration and overhead. This one, even if everything is
>   running on the same machine, is where the costs savings are high with
>   batch insert : if you run everything on inserts, the inserting
>   process has to wait for the database to acknowledge each operation
>   before submitting the next one. And inserting records in bacula
>   isn't all about inserts. There are some selects too, to lookup for
>   pathid and filenameid. You also pay a penalty because you send back
>   data to the caller (how many inserted records and the like).
> 
> To give you a very simplified simulation, I've tried inserting 1 million 
> integer 
> values the way the batch insert works (copy), It takes 3.5 seconds, mostly IO 
> bound.
> 
> With inserts, 77s, mostly CPU bound.
> 
> The gains are lower with bacula, because data inserted is more complex, bacula
> itself is more complex, there are indexes to maintain, but it gives you an 
> idea
> of why there is a batch mode.

Actually, this is what I don't get. Postgresql is a highly scalable,
robust database system and it is being used as a data dump rather than a
working tool for creating a transaction-based working catalogue.

Yes, a batch insert is faster than a specfic insert, but the latter
should be done at the "written-to-tape" transaction time, and could be
done asynchronously, but in a transaction. Its pretty crazy for a >7TB
tape backup to fail because of a temporary file/table problem at the END
of the backup process rather than during it.

Also the copy writes to a temporary table and then some rather curious
inserts are done into the Bacula tables. E.g:

     INSERT INTO 
        Path (Path) 
        SELECT 
            a.Path 
        FROM (
            SELECT DISTINCT Path FROM batch
        ) AS a 
        WHERE NOT EXISTS 
            (SELECT Path FROM Path WHERE Path = a.Path)

This is a cludge (with an inefficient correlated subquery!) that could
easily miss paths which exist from previous, unrelated backups. A
continuous insert process against a job and mediaid simply wouldn't need
to do this.

More native support for postgres would also allow, for instance, faster
and more powerful searching of catalogues for retrieves, rather than the
strange restore procedure required through bconsole.

I'm delighted to be using Bacula (particularly after our tribulations
with Amanda) but it seems to me that Bacula could lean much more heavily
on Postgresql.

-- 
Rory Campbell-Lange 
rory AT campbell-lange DOT net

------------------------------------------------------------------------------
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users