fsck caused .BFS files to get moved to lost+found. How find file names?

alexp36

ADSM.ORG Member
Joined
Jun 14, 2018
Messages
15
Reaction score
0
Points
0
Need some help recovering from an AIX filesystem issue.
We unmounted the filesystem where our disk storage pool sits.
When trying to re-mount the filesystem, it wouldn't mount, and needed us to run an fsck.

After the fsck, all the TSM disk volume files are gone, and it looks like they've been moved to the lost+found directory.
All the file names have been changed to 2-3 digit numbers, like "65", or "100".

So, my understanding is we should be able to just rename them, and move them back out of the lost+found directory.
But, there are around 140 of them, and no way of knowing which lost file was which .BFS file.

Is there anyway of reading a .BFS format at the AIX level? Anyway of finding out what the filenames were?
We have a call open with IBM about this also, but so far they haven't been very helpful.

Thanks for any help or suggestions.
 
Likely not what you are looking for but I would look at restoring your storage pool from copy pool.
lost+found is exactly that. It's were fsck puts files or fragments of files that it has no idea of where they go.
The numbers you are seeing should represent the inode in which those files/fragments lived.

If IBM could some how read the headers of the file based backup files, they might be able to determine which was what. As far as I know, standard methods for trying to work out what was what (strings, file, od, lquerypv) is next to impossible.

I'm assuming the .bfs files are a file based volume. To see what is damaged, I think you'd need to run an audit.

Since you said you've engaged with IBM support, I'd lean on their expertise.

I'd also be concerned as to what caused the inconsistencies? Something such as https://www.ibm.com/support/pages/apar/IJ21577 could be a cause (Just first apar that came to mind). Once you've identified what has caused the inconsistencies, effort should be made to correct.

In my short time of being an AIX admin (8 years now) I've only had four events where I had to fsck a jfs2 filesystem. Three were on redundant vios after running an updateios or upgradeios command. One was on a production HA cluster that we actually had a SAN storage issue with and had to restore data from backup. Long story short, old NetApp had WAFL issues and the whole pool was lost.
 
Thanks for that, your ideas all tally with what we have found. Unfortunately no copy pool in this instance, for, ahh, "reasons". My colleague has worked through it with IBM and managed to get the volumes back online now.

We pretty well know what caused the inconsistencies - there was a SAN disk issue about a week ago, which took out the disk for about 6 hours.
Oddly there had been an automatic filesystem recovery by AIX after the SAN issue was sorted last week, and all appeared to be okay after that.

It wasn't until we unmounted the filesystem for an unrelated reason, and tried to re-mount that we found we had problems.

I've had to run fsck's many times (around 20 years in AIX :) ), and never once actually "lost" any files.
I was fairly well stumped initially. fsck appeared to complete successfully, great, filesystem mounted okay, awesome, and then... no files. uhoh.

I'm actually on holiday right now, so I didn't get too involved, but I'll find out in detail what the procedure was when I'm back on Monday, and post some info here.

Cheers.
 
Ahh love "reasons".
San issues are a pain sometimes. And yeah only time fsck didn't work was when the underlying storage was just so messed up.
Glad everything got sorted.

Enjoy your holiday! Stop thinking about work.
 
So the fix in our case, was to log in to db2, and go through the logs finding the most recent file modification time of each file volume.
That was able to be compared with mod times in the lost+found directory to piece together which file was which. Apparently there were 2 that had the exact same mod time, down to the second, so I believe they just had to guess those, and hope it was right. Which it was.
As mentioned I wasn't working, but I believe it was quite a time-consuming process, so the server was out of action for over 24 hours in the end.
 
Back
Top