Networker

Re: [Networker] 7.2 upgrades...

2010-03-15 00:30:41
Subject: Re: [Networker] 7.2 upgrades...
From: Anacreo <anacreo AT GMAIL DOT COM>
To: NETWORKER AT LISTSERV.TEMPLE DOT EDU
Date: Sun, 14 Mar 2010 23:20:17 -0500
Hmm.. your situation has some uniqueness to it..  I'd keep harping on
isolating each components...  I'm curious what your total throughput is
between your two servers, have you tried just simply doing a few parallel
FTP runs between the two servers, and get some timings off of that?  What's
sticking in my head is that 60MB/s is basically half of a gigabit pipe,
which could be tied to some sort of Ethernet issue through your core, etc...

I'm now reading up on the ASYNCH_IO issues where NetWorker states that
Asynch IO is not available on Solaris 10 and that there will be a
performance degradation for IO intensive operations, such as cloning.  It
looks like Solaris 10 does support ASYNCH_IO but its actually handled now as
a user thread instead of a kernel thread unless its a raw device.  Combine
this with the fact that the T2000 has a slower processor (1.0-1.2ghz) the
effects could be compounded.

Also is your T2000 at update 8 as well, because there were a lot of
"auto-tuning" network features that were tweaked for update 8...

Ok I'm out of ideas, but really curious if you do solve this issue.  Good
luck!

Alec


On Sun, Mar 14, 2010 at 6:59 PM, Yaron Zabary <yaron AT aristo.tau.ac DOT 
il>wrote:

>
>  The single thread issue was supposed to be fixed with U6 (I am at U8), so
> I hope this is not the problem. Anyhow, I don't have problems getting at
> 100MBps when writing, so I guess I should be OK with reading as well (with
> respect to the CPU calculation of sha256 checksums).
>
>  But, keep on sending those ideas, I am trying to figure this out for a
> couple of months without much success.
>
>
> Anacreo wrote:
>
>> In either case please read below, I've seen the effects of this first hand
>> and it is easy to see if its causing your performance degradation:
>>
>> From:
>> http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
>>
>> Tuning ZFS Checksums
>>
>> End-to-end checksumming is one of the great features of ZFS. It allows ZFS
>> to detect and correct many kinds of errors other products can't detect and
>> correct. Disabling checksum is, of course, a very bad idea. Having file
>> system level checksums enabled can alleviate the need to have application
>> level checksums enabled. In this case, using the ZFS checksum becomes a
>> performance enabler.
>>
>> The checksums are computed asynchronously to most application processing
>> and
>> should normally not be an issue. However, each pool currently has a single
>> thread computing the checksums (RFE below) and it's possible for that
>> computation to limit pool throughput. So, if disk count is very large (>>
>> 10) or single CPU is weak (< Ghz), then this tuning might help. If a
>> system
>> is close to CPU saturated, the checksum computations might become
>> noticeable. In those cases, do a run with checksums off to verify if
>> checksum calculation is a problem.
>>
>> If you tune this parameter, please reference this URL in shell script or
>> in
>> an /etc/system comment.
>>
>>
>> http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Tuning_ZFS_Checksums
>>
>> Verify the type of checksum used:
>>
>> zfs get checksum <filesystem>
>>
>> Tuning is achieved dynamically by using:
>>
>> zfs set checksum=off <filesystem>
>>
>> And reverted:
>>
>> zfs set checksum='on | fletcher2 | fletcher4 | sha256' <filesystem>
>>
>> Fletcher2 checksum (the default) has been observed to consume roughly 1Ghz
>> of a CPU when checksumming 500 MByte per second.
>>
>> On Sun, Mar 14, 2010 at 6:46 PM, Anacreo <anacreo AT gmail DOT com> wrote:
>>
>>  Ok so how are you accessing the Thumper as an adv_file over NFS or as an
>>> iSCSI LUN?
>>>
>>> Have you been able to clock your read speed off of the Thumper through to
>>> the T1000?  If you can write through at 100MB/s can you read for at least
>>> that speed over x number of connections - where X is the number of
>>> devices
>>> you're trying to simultaneously clone too?
>>>
>>> Alec
>>>
>>> On Sun, Mar 14, 2010 at 6:30 PM, Yaron Zabary <yaron AT aristo.tau.ac DOT il
>>> >wrote:
>>>
>>>
>>>> Anacreo wrote:
>>>>
>>>>  Yaron,
>>>>>  What version of Solaris are you running on the Thumper, update 8 is
>>>>> significantly faster than say update 3?
>>>>>
>>>>>   The Thumper is U8 with recommended patches from November 2009 (kernel
>>>> is
>>>> Generic_141445-09).
>>>>
>>>>
>>>>  Do you have any SSD's in the
>>>>
>>>>  thunper to handle L2ARC?
>>>>>
>>>>>   No.
>>>>
>>>>
>>>>
>>>>   What kind of performance are you getting?
>>>>>
>>>>>   As I said the problem is when staging from an AFTD on the Thumper to
>>>> an
>>>> LTO4 drive (with LTO3 media) on the T1000. I can get ~30MBps per clone
>>>> session. If I run a few of them (there are four drives on the T1000),
>>>> the
>>>> total will be ~60MBps. Staging from an AFTD which is local to the T1000
>>>> can
>>>> do ~70MBps. The Thumper and the T1000 are both connected via a 4 port
>>>> aggregate (dladm create-aggr -d e1000g0 -d e1000g1 -d e1000g2 -d e1000g3
>>>> 1)
>>>> to the same Cisco 3560 switch, so, in theory, if I get unlucky and all
>>>> sessions hit the same interface, the network should limit me to 125MBps.
>>>>
>>>>
>>>>
>>>>   Have you tried a few tests like backing up Dev Random?   To see where
>>>>> you're bottlenecking?
>>>>>
>>>>>   All backups go to the Thumper and with 64 sessions I can get ~100MBps
>>>> which is OK because all clients are connected via a single 1GigE link of
>>>> the
>>>> above mentioned 3560 at the campus. So, there is no performance problem
>>>> with
>>>> the Thumper when doing backups. Also, zpool iostat 1 does not show any
>>>> heavy
>>>> load on the pool (I cannot send some output because there is a scrub
>>>> running
>>>> right now).
>>>>
>>>>
>>>>  Feel free to make more suggestions.
>>>>
>>>>
>>>>
>>>>
>>>>  Alec
>>>>>
>>>>> On 3/14/10, Yaron Zabary <yaron AT aristo.tau.ac DOT il> wrote:
>>>>>
>>>>>  tkimball wrote:
>>>>>>
>>>>>>  We went from 7.1.3 + AlphaStor to 7.4.4 and no AlphaStor *cheer*.
>>>>>>>  The
>>>>>>> version choice was made over a year ago, based on Stan's experience
>>>>>>> with
>>>>>>> it on Sun hardware.
>>>>>>>
>>>>>>> I've been overall pleased with the new version, in particular how
>>>>>>> much
>>>>>>> easier library management is (compared to AlphaStor anyway).  I'm
>>>>>>> still
>>>>>>> poking and prodding at the GUI to see how far I can take it, and how
>>>>>>> to
>>>>>>> document procedures for our Ops group.
>>>>>>>
>>>>>>> Right now my only gripe is that 7.6 came out at the wrong time (final
>>>>>>> eval
>>>>>>>  before rollout) otherwise my NMC server would have been that instead
>>>>>>> of
>>>>>>> 7.4.4.  Now I'm waiting until at least June before getting back
>>>>>>> something
>>>>>>> similar to the old nwadmin.
>>>>>>>
>>>>>>> Most of our troubles come from old Windows boxes, even before the
>>>>>>> upgrade
>>>>>>> (W2K Server and AdvServer), though we've now had one incident where
>>>>>>> the
>>>>>>> Adv_file devices started unmounting but would not re-mount (said it
>>>>>>> was
>>>>>>> not in media db!).  Bouncing the software fixed that, it had been
>>>>>>> running
>>>>>>> for almost a month.
>>>>>>>
>>>>>>> Yaron, can you give details regarding what your DBO issues are?  I've
>>>>>>> not
>>>>>>> seen any throughput issues (actually that's been better, now that the
>>>>>>> Server itself also went from E450 to T2000).  However, our disk array
>>>>>>> is 1
>>>>>>> Gig FC so may not be able to help.
>>>>>>>
>>>>>>>   Our setup is AFTD which is located on a Sun X4500 (Thumper) and the
>>>>>> tape library is connected to a T1000. Staging from the x4500 to the
>>>>>> T1000 is performing poorly compared to the old setup (a Clariion AX150
>>>>>> which was directly connected to the T1000). I suspected that this was
>>>>>> related to LGTsc30475 (aka 30475nw "Cloning is slow from the local to
>>>>>> remote device"). I was hoping that this will be solved by 7.5.2, but
>>>>>> after upgrading this morning, things are quite the same.
>>>>>>
>>>>>>  The issues we had (on previous versions) were:
>>>>>>
>>>>>>  . "duplicate name; pick new name or delete old one"  (on 7.2.2)
>>>>>> upgraded to 7.3.4
>>>>>>
>>>>>>  . Owner notification bug (on 7.4.3).
>>>>>>
>>>>>>  . LGTsc24106 (on 7.4.4). (volretent) Patched a few binaries.
>>>>>>
>>>>>>  . Some nsrck bug (on 7.4.5) (nsrck hang on unknown clients unrelated
>>>>>> to AFTD). Patched some binaries.
>>>>>>
>>>>>>  . "Failed to fetch the saveset(ss_t) structure for ssid" (on 7.5.1).
>>>>>> Moved to 7.5.1.7.
>>>>>>
>>>>>>  --TSK
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> evilensky AT gmail DOT com wrote:
>>>>>>>
>>>>>>>  Hi,
>>>>>>>>
>>>>>>>> So what's the latest word on upgrades from 7.2?  Is 7.6 a viable
>>>>>>>> option or is 7.4/7.5 more "fully cooked"?  We're not really looking
>>>>>>>> for features so much as support for the latest client platforms and
>>>>>>>> stability.  We're not going to be spending much money on upgraded
>>>>>>>> hardware either, so an in-place upgrade is the most likely.  Thanks
>>>>>>>> in
>>>>>>>> advance for any opinions/observations.
>>>>>>>>
>>>>>>>>
>>>>>>>> +----------------------------------------------------------------------
>>>>>>> |This was sent by t.s.kimball AT gmail DOT com via Backup Central.
>>>>>>> |Forward SPAM to abuse AT backupcentral DOT com.
>>>>>>>
>>>>>>> +----------------------------------------------------------------------
>>>>>>>
>>>>>>> To sign off this list, send email to [email protected]
>>>>>>> type
>>>>>>> "signoff networker" in the body of the email. Please write to
>>>>>>> networker-request AT listserv.temple DOT edu if you have any problems 
>>>>>>> with
>>>>>>> this
>>>>>>> list. You can access the archives at
>>>>>>> http://listserv.temple.edu/archives/networker.html or
>>>>>>> via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER
>>>>>>>
>>>>>>>  --
>>>>>>
>>>>>> -- Yaron.
>>>>>>
>>>>>> To sign off this list, send email to listserv AT listserv.temple DOT edu 
>>>>>> and
>>>>>> type
>>>>>> "signoff networker" in the body of the email. Please write to
>>>>>> networker-request AT listserv.temple DOT edu if you have any problems 
>>>>>> with
>>>>>> this
>>>>>> list. You can access the archives at
>>>>>> http://listserv.temple.edu/archives/networker.html or
>>>>>> via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER
>>>>>>
>>>>>>
>>>>>>  --
>>>>
>>>> -- Yaron.
>>>>
>>>>
>>>
>> To sign off this list, send email to listserv AT listserv.temple DOT edu and
>> type "signoff networker" in the body of the email. Please write to
>> networker-request AT listserv.temple DOT edu if you have any problems with 
>> this
>> list. You can access the archives at
>> http://listserv.temple.edu/archives/networker.html or
>> via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER
>>
>
> --
>
> -- Yaron.
>

To sign off this list, send email to listserv AT listserv.temple DOT edu and 
type "signoff networker" in the body of the email. Please write to 
networker-request AT listserv.temple DOT edu if you have any problems with this 
list. You can access the archives at 
http://listserv.temple.edu/archives/networker.html or
via RSS at http://listserv.temple.edu/cgi-bin/wa?RSS&L=NETWORKER

<Prev in Thread] Current Thread [Next in Thread>