Veritas-bu

[Veritas-bu] Checking to see if millions of files are backed up?

2007-03-26 18:07:48
Subject: [Veritas-bu] Checking to see if millions of files are backed up?
From: jpiszcz.backup at gmail.com (Justin Piszcz)
Date: Mon, 26 Mar 2007 18:07:48 -0400
The good, 1 network connection to pull the data from the master to the client.
The bad, it will be a lot of data for a larger catalog, but one can
always compress it.

The reason for this, is, I am not sure how many of you have used
NetBackup 6.0MPx or for how long, but early on, maybe this is better
in 6.0MP4, but if you ran a lot of commands very quickly or
simultaneously, it would crash various services on the master.
Otherwise, bplist may not be a bad idea for each file.  If it is
stable under 6.0, it is still an option; however, it would be nice to
find a fast solution that does not involve 1,000,000 queries, such as
the method I mentioned.

Justin.

On 3/26/07, Justin Piszcz <jpiszcz.backup at gmail.com> wrote:
> The problem I worry about with running a bplist on each file is the
> network overhead and the overhead that will hit the master server.  If
> you have 50 servers with 1,000,000 files each, that would be 50
> million network requests total.  I was thinking dump the catalog onto
> the local machine, build a hash (or use certain UNIX utilities in
> shell to emulate this concept) and then run a find on the filesystem
> and then loop through each line in it, compare it to the hash or
> datadump of everything that has been backed up.
>
> I have an idea that might work, think of:
>
> 1. file A has dirs a,b,c (sort | uniq it)
> 2. file B has dirs c,d,e (sort | uniq it)
>
> Think of:
>
> cat fileA fileB | sort | uniq -c
>
> If any line starts with > 1, then it appeared in both files, and
> hence, has been backed up.  This has some (local high overhead);
> however, it may be the fastest solution without worrying about hashing
> the entire file into memory with perl.
>
> What do you think of this solution?  I plan on trying this later or tomorrow.
>
> Justin.
>
>
>
>
> On 3/26/07, Darren Dunham <ddunham at taos.com> wrote:
> > > If one is to create a script to ensure that the files on the
> > > filesystem are backed upon before removing them, what is the best
> > > data-store model for doing so?
> > >
> > > Obviously, if you have > 1,000,000 files in the catalog and you need
> > > to check each of those, you do not want to bplist -B -C -R 999999
> > > /path/to/file/1.txt for each file.  However, you do not want to grep
> > > "1" one_gigabyte_catalog.txt either as there is really too much
> > > overhead in either case.
> >
> > A million is a lot, but with sufficiently large machines, you might be
> > able to fit all the names in memory (and if you're really lucky, a perl
> > hash).
> >
> > With a lot of memory, I'd build a name hash from the expected files,
> > then run through bplist and verify that every file was in the hash.
> >
> > When the memory needs of the hash cause this method to break down, you
> > can move to alternative databases.  There are several perl modules that
> > let you set up a quick database without installing MySQL or Postgres.
> > (but you could use those if you had them).   Then the comparison is
> > slower, but much less awful than running a million invocations of
> > bpflist just to check one file at a time.
> >
> > --
> > Darren Dunham                                           ddunham at taos.com
> > Senior Technical Consultant         TAOS            http://www.taos.com/
> > Got some Dr Pepper?                           San Francisco, CA bay area
> >          < This line left intentionally blank to confuse you. >
> > _______________________________________________
> > Veritas-bu maillist  -  Veritas-bu at mailman.eng.auburn.edu
> > http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
> >
>