Bacula-users

Re: [Bacula-users] Hung on "Dir inserting attributes"

2008-11-14 03:28:24
Subject: Re: [Bacula-users] Hung on "Dir inserting attributes"
From: Marc Cousin <mcousin AT sigma DOT fr>
To: Jason Dixon <jdixon AT omniti DOT com>
Date: Fri, 14 Nov 2008 09:24:42 +0100
Le Thursday 13 November 2008 23:39:19 Robert Treat, vous avez écrit :
> Jason Dixon wrote:
> > Date: Thu, 13 Nov 2008 11:44:28 +0100
> > From: Marc Cousin <mcousin AT sigma DOT fr>
> > Subject: Re: [Bacula-users] Hung on "Dir inserting attributes"
> > To: bacula-users AT lists.sourceforge DOT net
> > Cc: Jason Dixon <jdixon AT omniti DOT com>
> >
> > Le Wednesday 12 November 2008 16:39:00 Jason Dixon, vous avez ?crit?:
> >> On Wed, Nov 12, 2008 at 07:07:13AM -0800, Dan Langille wrote:
> >>> On Nov 11, 2008, at 2:32 PM, Jason Dixon wrote:
> >>>> We have a new Bacula server (2.4.2 on Solaris 10 x86) that runs fine
> >>>> for most backup jobs.  However, we've encountered a particular job
> >>>> that hangs indefinitely with the status "Dir inserting attributes". 
> >>>> It's important to note that all of our other jobs complete
> >>>> successfully except this one.
> >>>
> >>> What others have said confirmed what I thought.  I too think this job
> >>> is not hanging.  It is inserting.
> >>
> >> For two days?  This isn't a slow box.  Here is the pg_stat_activity we
> >> gathered yesterday.  According to our DBA, it was deadlocked as
> >> exhibited by pids 6192 and 21732 both having ExclusiveLock's.
> >>
> >> http://pastebin.com/f26d96438
> >>
> >> Thanks,
> >
> > I'd first like to expain one thing :
> >
> > PostgreSQL (like most RDBMS) has a deadlock detection algorithm. So if
> > you really had a deadlock, you'd know : one of the deadlocked processes
> > would be killed.
>
> yep
>
> > On to your problem :
> >
> > Nothing in the pastebin shows a locked process.
>
> true
>
> > To have this information, first, we need the waiting column from
> > pg_stats_activity. As long as it is false, the session has all its locks
> > granted.
>
> yep
>
> > If the colum is 'true' for a statement, then we can dig into pg_locks,
> > but for this, we'll need more columns too (at least the granted column,
> > to know if the lock is granted or not), but all columns of pg_locks is
> > better (it tells on what the lock is, who's waiting for who, etc ...).
>
> yep
>
> > It seems to me that the most plausible explanation isn't a lock problem
> > (the whole thing is designed so that inserts into file doesn't lock), but
> > a tuning problem :
> > - are the statistics up to date ? (and are you vacuuming regularly. I
> > hope so, but I'd rather ask)
> > - are shared_buffers and work_mem big enough (how much memory do you have
> > ?). This query is supposed to do hash or merge joins between batch, path
> > and filename, which are quite large tables. If it decided to do a nested
> > loop, for 15 million rows, it would be catastrophic
>
> Jason asked me to take a look at this, and as you were surmising, the
> issue is not a locking issue, but more one of it just being a really
> slow query. here is iostat on the relevant disks (which show them
> somewhat saturated)
>
>      r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>     86.1   42.0 2203.1  564.9  0.1  0.5    1.2    3.9  13  50 c3t0d0
>     81.1   42.0 2073.0  564.9  0.2  0.5    1.8    4.0  22  49 c3t1d0
>     81.1   42.0 2077.5  564.9  0.2  0.5    2.0    4.2  22  52 c3t2d0
>     76.1   42.0 1947.4  564.9  0.2  0.5    1.6    4.5  16  54 c3t3d0
>     85.1   42.0 2178.6  565.4  0.2  0.5    1.4    4.2  16  54 c3t4d0
>     79.1   42.0 2022.4  565.4  0.2  0.5    1.3    4.3  15  53 c3t5d0
>
> and here is the explain plan, which shows hash joins as you surmised:
>
> bacula=#  explain  SELECT batch.FileIndex, batch.JobId, Path.PathId,
> Filename.FilenameId,batch.LStat, batch.MD5 FROM pg_temp_2.batch JOIN
> Path ON (batch.Path = Path.Path) JOIN Filename ON (batch.Name =
> Filename.Name);
>                                       QUERY PLAN
> ---------------------------------------------------------------------------
>---------- Hash Join  (cost=238431.15..3057911.33 rows=19755330 width=80)
>     Hash Cond: ((batch.path)::text = path.path)
>     ->  Hash Join  (cost=204413.28..1988697.02 rows=19755330 width=108)
>           Hash Cond: ((batch.name)::text = filename.name)
>           ->  Seq Scan on batch  (cost=0.00..600723.30 rows=19755330
> width=136)
>           ->  Hash  (cost=96633.57..96633.57 rows=5306057 width=34)
>                 ->  Seq Scan on filename  (cost=0.00..96633.57
> rows=5306057 width=34)
>     ->  Hash  (cost=16466.72..16466.72 rows=696572 width=79)
>           ->  Seq Scan on path  (cost=0.00..16466.72 rows=696572 width=79)
>
> i didn't spend time trying to tune it further than that (well, made sure
> filename and path were freshly analyzed), but merge joins nor nested
> loops are likely to be of any help here.
>
> One thing I did wonder was that temp table "batch". Being a temp table
> is good, but does anyone know if the table is analyzed before trying to
> run this insert statement? the explain plan expects 19755330 rows, but a
> count only gives me back 12666152 (granted, that number is probably not
> reliable... and I'm not sure more accurate statistics  would help, but
> the probably wouldn't hurt). additionally, has anyone ever tried adding
> indexes to the temp table to see if it would speed up query execution?

Stats aren't up to date, they aren't beeing analysed by bacula beforehand, but 
the estimates aren't completely false either

Seeing the respective sizes of the tables, it seems to be the plan is the 
right one : doing nested loops for 19 (or 12) million rows when filename and 
path are much smaller wouldn't be that smart...

If it is that slow, I think of a few reasons first :
- not enough work_mem (hashes go there). Here, we have it up to at least 128MB
- not enough shared_buffers ? (you need them to write fast :) 
- wal_segments too small ? (maybe it is checkpointing like crazy)

What are these parameters values ? How much memory is available on the 
server ?

For now, my best bet would be insufficient work_mem, because filename is quite 
big and would take a lot of memory to store the whole hash in ram (it will 
have to go to disk if work_mem is too small)


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Bacula-users mailing list
Bacula-users AT lists.sourceforge DOT net
https://lists.sourceforge.net/lists/listinfo/bacula-users