Veritas-bu

[Veritas-bu] Checking to see if millions of files are backed up?

2007-03-26 17:58:06
Subject: [Veritas-bu] Checking to see if millions of files are backed up?
From: jpiszcz.backup at gmail.com (Justin Piszcz)
Date: Mon, 26 Mar 2007 17:58:06 -0400
The problem I worry about with running a bplist on each file is the
network overhead and the overhead that will hit the master server.  If
you have 50 servers with 1,000,000 files each, that would be 50
million network requests total.  I was thinking dump the catalog onto
the local machine, build a hash (or use certain UNIX utilities in
shell to emulate this concept) and then run a find on the filesystem
and then loop through each line in it, compare it to the hash or
datadump of everything that has been backed up.

I have an idea that might work, think of:

1. file A has dirs a,b,c (sort | uniq it)
2. file B has dirs c,d,e (sort | uniq it)

Think of:

cat fileA fileB | sort | uniq -c

If any line starts with > 1, then it appeared in both files, and
hence, has been backed up.  This has some (local high overhead);
however, it may be the fastest solution without worrying about hashing
the entire file into memory with perl.

What do you think of this solution?  I plan on trying this later or tomorrow.

Justin.




On 3/26/07, Darren Dunham <ddunham at taos.com> wrote:
> > If one is to create a script to ensure that the files on the
> > filesystem are backed upon before removing them, what is the best
> > data-store model for doing so?
> >
> > Obviously, if you have > 1,000,000 files in the catalog and you need
> > to check each of those, you do not want to bplist -B -C -R 999999
> > /path/to/file/1.txt for each file.  However, you do not want to grep
> > "1" one_gigabyte_catalog.txt either as there is really too much
> > overhead in either case.
>
> A million is a lot, but with sufficiently large machines, you might be
> able to fit all the names in memory (and if you're really lucky, a perl
> hash).
>
> With a lot of memory, I'd build a name hash from the expected files,
> then run through bplist and verify that every file was in the hash.
>
> When the memory needs of the hash cause this method to break down, you
> can move to alternative databases.  There are several perl modules that
> let you set up a quick database without installing MySQL or Postgres.
> (but you could use those if you had them).   Then the comparison is
> slower, but much less awful than running a million invocations of
> bpflist just to check one file at a time.
>
> --
> Darren Dunham                                           ddunham at taos.com
> Senior Technical Consultant         TAOS            http://www.taos.com/
> Got some Dr Pepper?                           San Francisco, CA bay area
>          < This line left intentionally blank to confuse you. >
> _______________________________________________
> Veritas-bu maillist  -  Veritas-bu at mailman.eng.auburn.edu
> http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu
>