Performance Issue with SCSI generic attached (TSM passthru) Tapes on SLES11 SP2

peinkofer

Newcomer
Joined
Oct 17, 2012
Messages
2
Reaction score
0
Points
0
Hi,

since I did not find anything on this in the internet I just wanted to leave the information here to see if any other has also encountered this issue.

We recently upgraded our TSM Servers from SLES11 SP1 to SLES11 SP2 (x86_64). After that we saw a big performance problem with our T10K drives
which are connected to TSM via the linux scsi-generic driver. The problem was so severe that a server which before read/wrote nearly 1 GB/s
from/to tape was limited to approx. 200 MB/s tape throughput and also tape IO was stalling from time to time. On other systems which used
IBM tapes via the lin_tape driver this issue could not be observed.

After some analysis, we found out that it was related to a change in the linux kernel, made in version 2.6.37 (SP1 has 2.6.32 and SP2 has 3.0.something).
The problem seems to be, that TSM issues its IOs to the sg devices via the ioctl syscall. This syscall was protected by the big kernel lock prior to kernel 2.6.37.
With 2.6.37 they removed the big kernel lock and replaced it by a driver private mutex. However semantics of bkl and mutexes are not the same.
(The difference basically seems to be that the bkl was released/regained by the scheduler when a process enters a wait/sleep (for IO completion for example which ioctl does)
giving other processes the chance to aquire the bkl while the process waits)
However since a mutex is not released/regained by the scheduler but held until the process releases it on it's own, this leads to the problem that only a single IO
via the sg driver can be issued at a single time. So this leads to a serialization of all IOs passing through the sg driver. Which is especially a bad thing if you do
an rewind IO for example since no further IO can be issued to any tape as long the tape rewinds.

You can also see this by watching /proc/scsi/sg/debug. You will see only one IO active at a single time.

From what I've seen in the Linux Kernel GIT repository, all kernels from 2.6.37 to 3.4 have this problem. So currently (from the TSM supported Linux Distros) only SLES11 SP2 is affected
but I guess that the next RHEL 6.4 may also be affected by this.

I'm currently in contact with IBM and SUSE support on this issue. Will keep you informed if I got a solution.

Many Regards,
Stephan Peinkofer
 
Hi, Did you receive an answer to this issue - I think I have the same problem!
 
Hi,

SuSE has a PTF for this. Just contact SuSE Support. You can reference to my SR 10797804871. (Should be Kernel 3.0.42-0.7.3.4798.1.PTF.785496.x86_64)

Many Regards,
Stephan Peinkofer
 
Back
Top