Enclosure resets periodically under Linux 4.3.3

Support and Discussion for H82-SU3S2 ProBox - For PC Users

Enclosure resets periodically under Linux 4.3.3

Postby crystalkeeper » January 9th, 2016, 3:44 pm

Trying to resolve this issue and figure out if it's a bad enclosure, bad disk, or bad drivers. If it were a bad drive, what does a disconnect due to a bad drive look like? When my enclosure disconnects, I don't need to power cycle. I can just remount. In my dmesg I get this:

Code: Select all
[183223.326516] usb 2-1: reset SuperSpeed USB device number 2 using xhci_hcd
[183881.901621] usb 2-1: reset SuperSpeed USB device number 2 using xhci_hcd
[183939.268812] usb 2-1: reset SuperSpeed USB device number 2 using xhci_hcd
[186458.911081] usb 2-1: reset SuperSpeed USB device number 2 using xhci_hcd
[187508.192765] usb 2-1: Device not responding to setup address.
[187508.396123] usb 2-1: Device not responding to setup address.
[187508.596369] usb 2-1: device not accepting address 2, error -71
[187512.576871] usb usb2-port1: Cannot enable. Maybe the USB cable is bad?
[187516.557333] usb usb2-port1: Cannot enable. Maybe the USB cable is bad?
[187520.537770] usb usb2-port1: Cannot enable. Maybe the USB cable is bad?
[187520.537985] usb 2-1: USB disconnect, device number 2
[187520.538176] sd 6:0:0:4: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[187520.538185] sd 6:0:0:4: [sde] CDB: Write(10) 2a 00 1e eb a4 00 00 00 80 00
[187520.538191] blk_update_request: I/O error, dev sde, sector 518759424
[187520.538226] sd 6:0:0:4: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[187520.538231] sd 6:0:0:4: [sde] CDB: Write(10) 2a 00 1e eb a4 80 00 00 80 00
[187520.538234] blk_update_request: I/O error, dev sde, sector 518759552
[187520.538254] sd 6:0:0:4: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[187520.538258] sd 6:0:0:4: [sde] CDB: Write(10) 2a 00 1e eb a5 00 00 00 80 00
[187520.538261] blk_update_request: I/O error, dev sde, sector 518759680
[187520.538283] sd 6:0:0:4: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[187520.538288] sd 6:0:0:4: [sde] CDB: Write(10) 2a 00 1e eb a5 80 00 00 80 00
[187520.538292] blk_update_request: I/O error, dev sde, sector 518759808
[187520.538316] sd 6:0:0:4: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[187520.538323] sd 6:0:0:4: [sde] CDB: Write(10) 2a 00 1e eb a6 00 00 00 80 00
[187520.538328] blk_update_request: I/O error, dev sde, sector 518759936
[187520.538352] sd 6:0:0:4: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[187520.538358] sd 6:0:0:4: [sde] CDB: Write(10) 2a 00 1e eb a6 80 00 00 80 00
[187520.538363] blk_update_request: I/O error, dev sde, sector 518760064
[187520.538396] sd 6:0:0:4: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[187520.538402] sd 6:0:0:4: [sde] CDB: Write(10) 2a 00 1e eb 9a 80 00 00 80 00
[187520.538406] blk_update_request: I/O error, dev sde, sector 518756992
[187520.538431] sd 6:0:0:4: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[187520.538438] sd 6:0:0:4: [sde] CDB: Write(10) 2a 00 1e eb 9a 00 00 00 80 00
[187520.538442] blk_update_request: I/O error, dev sde, sector 518756864
[187520.538475] sd 6:0:0:4: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[187520.538482] sd 6:0:0:4: [sde] CDB: Write(10) 2a 00 1e eb 99 80 00 00 80 00
[187520.538486] blk_update_request: I/O error, dev sde, sector 518756736
[187520.538510] sd 6:0:0:4: [sde] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[187520.538516] sd 6:0:0:4: [sde] CDB: Write(10) 2a 00 1e eb 99 00 00 00 80 00
[187520.538521] blk_update_request: I/O error, dev sde, sector 518756608
[187520.545016] scsi 6:0:0:2: rejecting I/O to offline device
[187520.545019] scsi 6:0:0:2: [sdc] killing request
[187520.545021] scsi 6:0:0:2: rejecting I/O to offline device
[187520.545022] scsi 6:0:0:2: [sdc] killing request
[187520.545024] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545031] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545035] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545038] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545041] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545045] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545048] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545052] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545055] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545058] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545061] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545066] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545069] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545072] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545076] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545079] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545082] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545085] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545090] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545093] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545097] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545101] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545104] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545109] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545112] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545117] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545120] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545123] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545128] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545130] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545134] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545137] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545141] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545144] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545148] scsi 6:0:0:2: rejecting I/O to dead device
[187520.545152] scsi 6:0:0:2: rejecting I/O to dead device
[187537.509513] usb 2-1: new SuperSpeed USB device number 3 using xhci_hcd
[187537.526227] usb 2-1: New USB device found, idVendor=152d, idProduct=0567
[187537.526234] usb 2-1: New USB device strings: Mfr=10, Product=11, SerialNumber=5
[187537.526239] usb 2-1: Product: USB to ATA/ATAPI Bridge
[187537.526243] usb 2-1: Manufacturer: JMicron
[187537.526246] usb 2-1: SerialNumber: 152D00539000
[187537.527426] usb-storage 2-1:1.0: USB Mass Storage device detected
[187537.527634] usb-storage 2-1:1.0: Quirks match for vid 152d pid 0567: 4000000
[187537.527717] scsi host9: usb-storage 2-1:1.0
[187538.525299] scsi 9:0:0:0: Direct-Access     WL1000GS A6472            0125 PQ: 0 ANSI: 6
[187538.525898] scsi 9:0:0:1: Direct-Access     WL1000GS A6472            0125 PQ: 0 ANSI: 6
[187538.526413] scsi 9:0:0:2: Direct-Access     WL1000GS A6472            0125 PQ: 0 ANSI: 6
[187538.526919] scsi 9:0:0:3: Direct-Access     WL1000GS A6472            0125 PQ: 0 ANSI: 6
[187538.527318] scsi 9:0:0:4: Direct-Access     WL1000GS A6472            0125 PQ: 0 ANSI: 6
[187538.527631] scsi 9:0:0:5: Direct-Access     WL1000GS A6472            0125 PQ: 0 ANSI: 6
[187538.528140] scsi 9:0:0:6: Direct-Access     WL1000GS A6472            0125 PQ: 0 ANSI: 6
[187538.529008] sd 9:0:0:0: Attached scsi generic sg0 type 0
[187538.529048] sd 9:0:0:0: [sdh] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
[187538.529214] sd 9:0:0:0: [sdh] Write Protect is off
[187538.529226] sd 9:0:0:0: [sdh] Mode Sense: 67 00 10 08
[187538.529392] sd 9:0:0:0: [sdh] No Caching mode page found
[187538.529400] sd 9:0:0:0: [sdh] Assuming drive cache: write through
[187538.529732] sd 9:0:0:1: Attached scsi generic sg1 type 0
[187538.529857] sd 9:0:0:1: [sdi] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
[187538.530080] sd 9:0:0:1: [sdi] Write Protect is off
[187538.530090] sd 9:0:0:1: [sdi] Mode Sense: 67 00 10 08
[187538.530283] sd 9:0:0:2: Attached scsi generic sg2 type 0
[187538.530337] sd 9:0:0:1: [sdi] No Caching mode page found
[187538.530345] sd 9:0:0:1: [sdi] Assuming drive cache: write through
[187538.530708] sd 9:0:0:2: [sdj] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
[187538.530840] sd 9:0:0:3: Attached scsi generic sg3 type 0
[187538.531136] sd 9:0:0:2: [sdj] Write Protect is off
[187538.531143] sd 9:0:0:2: [sdj] Mode Sense: 67 00 10 08
[187538.531323] sd 9:0:0:4: Attached scsi generic sg4 type 0
[187538.532021] sd 9:0:0:5: Attached scsi generic sg5 type 0
[187538.532622] sd 9:0:0:6: Attached scsi generic sg6 type 0
[187538.555818]  sdh: sdh1
[187538.556208] sd 9:0:0:3: [sdk] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
[187538.556325] sd 9:0:0:2: [sdj] No Caching mode page found
[187538.556331] sd 9:0:0:2: [sdj] Assuming drive cache: write through
[187538.556836] sd 9:0:0:3: [sdk] Write Protect is off
[187538.556842] sd 9:0:0:3: [sdk] Mode Sense: 67 00 10 08
[187538.557043] sd 9:0:0:4: [sdl] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
[187538.557157] sd 9:0:0:5: [sdm] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
[187538.557354] sd 9:0:0:6: [sdn] 1953525168 512-byte logical blocks: (1.00 TB/931 GiB)
[187538.557670] sd 9:0:0:3: [sdk] No Caching mode page found
[187538.557675] sd 9:0:0:3: [sdk] Assuming drive cache: write through
[187538.557891] sd 9:0:0:4: [sdl] Write Protect is off
[187538.557899] sd 9:0:0:4: [sdl] Mode Sense: 67 00 10 08
[187538.558029] sd 9:0:0:5: [sdm] Write Protect is off
[187538.558033] sd 9:0:0:5: [sdm] Mode Sense: 67 00 10 08
[187538.558239] sd 9:0:0:6: [sdn] Write Protect is off
[187538.558251] sd 9:0:0:6: [sdn] Mode Sense: 67 00 10 08
[187538.576422]  sdi: sdi1
[187538.576940] sd 9:0:0:4: [sdl] No Caching mode page found
[187538.576948] sd 9:0:0:4: [sdl] Assuming drive cache: write through
[187538.577125] sd 9:0:0:5: [sdm] No Caching mode page found
[187538.577135] sd 9:0:0:5: [sdm] Assuming drive cache: write through
[187538.577273] sd 9:0:0:6: [sdn] No Caching mode page found
[187538.577282] sd 9:0:0:6: [sdn] Assuming drive cache: write through
[187538.578194] sd 9:0:0:0: [sdh] Attached SCSI disk
[187539.123674]  sdj: sdj1
[187539.661491]  sdk: sdk1
[187539.662353] sd 9:0:0:1: [sdi] Attached SCSI disk
[187540.173077]  sdl: sdl1
[187540.198738]  sdm: sdm1
[187540.732171]  sdn: sdn1
[187540.750966] sd 9:0:0:2: [sdj] Attached SCSI disk
[187540.752489] sd 9:0:0:3: [sdk] Attached SCSI disk
[187540.753531] sd 9:0:0:4: [sdl] Attached SCSI disk
[187540.789842] sd 9:0:0:5: [sdm] Attached SCSI disk
[187540.790142] sd 9:0:0:6: [sdn] Attached SCSI disk
crystalkeeper
 
Posts: 5
Joined: January 9th, 2016, 3:39 pm

Re: Enclosure resets periodically under Linux 4.3.3

Postby IceNine » January 9th, 2016, 6:01 pm

Hello Crystalkeeper,

This does look like a garden-variety disconnect.

Whenever this has happened to me in the past, it's always been the result of a failing drive. What seemed to happen - at least in my case - is when a bad block was attempted to be read or written, the drive focuses all of its efforts on retrying to read or write that block, which seems to cause a log-jam somehow. What appears to happen (I'm guessing partly) is that this causes a timeout of some sort in the jmicron USB to SATA translation process that causes it to "reset" the entire process, forcing the drives to disconnect while doing so.

My first step in troubleshooting this would be to install the Smartmon tools package on your linux distro (on debian/ubuntu, it's as simple as running 'sudo apt-get install smartmontools')

Second, I would use the package's 'smartctl' tool to perform a self-test on ALL drives in your mediasonic enclosure. Now, the jmicron chip that does the SATA translation requires a separate command line parameter to properly communicate with the drive '-d sat'. Make sure you use it. You also want to enable offline data collection and all smart-related features for your drives.

Next, you should make a note of which devices are in your enclosure (/dev/sdc, /dev/sdx, etc, or use the devices listed in /dev/disk/by-id, or /dev/disk/by-path, whatever you are most comfortable with).

Then, for each of those drives, enable all smart features (using /dev/sdx as an example):

sudo smartctl -d sat -s on -o on -S on /dev/sdx

Perform this command for all drives in your enclosure. In fact, I would perform it for all drives in your system.

Next, I would perform a long self-test on each drive in your enclosure. If a self-test fails in the middle of a test, it's fine because you already enabled offline data collection in the previous step, which means you will still be able to see the failure in the devices logs later via the smartctl command (more on that later).

Run the command below to initiate a long self-test on that particular drive (again using /dev/sdx as an example):

sudo smartctl -d sat -t long /dev/sdx

Repeat the above command with all other devices in your enclosure. You may also use the devices in /dev/disk/by-id, by-path, etc if you wish. The self-test will take some time to complete. You don't have to wait for one drive to complete the selftest to start the others. You can run the command consecutively for all drives in the enclosure, and they will all do their self-test thing.

To check the progress, or see the results of the self-test, run the following command (again, /dev/sdx being an example):

smartctl -d sat -a /dev/sdx

You should see a lot of statistics generated here. The important things to look for are obviously failed self-tests, as that shows you have an unhealthy drive.

The most damning evidence of a failing hard drive would be the "Reallocated_Sector_Ct" attribute. Every SATA disk you buy has a set of sectors allocated as "reallocation" sectors, in case a drive runs into a bad sector that it can no longer read or write. When your drive encounters one, it takes one of these spare sectors and allocates it in place of the "bad" one. Typically, once a drive starts making these allocations, a drive is on its way out. If the "RAW_VALUE" is anything but 0, you may have a drive failing on you.

If you need any help deciphering the smartctl output for your drives, feel free to post them here and I will be happy to help you decipher them.
IceNine
 
Posts: 3
Joined: October 11th, 2014, 9:56 pm

Re: Enclosure resets periodically under Linux 4.3.3

Postby crystalkeeper » January 10th, 2016, 7:32 pm

Thanks so much for the reply IceNine!

Your theory on the intermittent error makes sense and seems plausible.

I did a long self-check on all connected disks and looked at the Reallocated_Sector_Ct variable but there were no errors and it's 0 for all of my disks. It seems like the root cause may be a little bit more elusive.

Two of the disks have a non-zero Raw_Read_Error_Rate, one is "1" and the other is "35." I'll look into this variable more to see what others have said. Do you know anything about it?
crystalkeeper
 
Posts: 5
Joined: January 9th, 2016, 3:39 pm

Re: Enclosure resets periodically under Linux 4.3.3

Postby IceNine » January 11th, 2016, 3:04 pm

https://en.wikipedia.org/wiki/S.M.A.R.T ... attributes

That Attribute is mostly meaningless from what the wiki states (Note attributes in Pink). I would ignore it and look at the others. Also, I would try another cable just in case. Did you have to install any USB3 drivers under Linux, or did the included XHCI support work fine?
IceNine
 
Posts: 3
Joined: October 11th, 2014, 9:56 pm

Re: Enclosure resets periodically under Linux 4.3.3

Postby crystalkeeper » January 13th, 2016, 11:19 am

Okay, thanks for pointing me to that page. One of my drives seems to report a different firmware than the rest and coincidentally seems to have a bad raw value for an attribute that is supposed to indicate drive failure, "Command_Timeout":

Code: Select all
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   142   137   021    Pre-fail  Always       -       3891
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       62
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11517
 10 Spin_Retry_Count        0x0033   100   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       30
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   096   061   000    Old_age   Always       -       4295032977
190 Airflow_Temperature_Cel 0x0022   076   058   000    Old_age   Always       -       24
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       16
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       45
194 Temperature_Celsius     0x0022   119   101   000    Old_age   Always       -       24
195 Hardware_ECC_Recovered  0x0036   200   200   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   199   000    Old_age   Always       -       140
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       2


The interesting part is that none of the other drives report that variable, only this one. All the drives are the same manufacturer/make, except this one happens to have a different firmware. I wonder, if any of the other drives were to have the same firmware, would they show a bad "Command_Timeout" value as well?

I will replace this drive and cross my fingers for no more disconnects (had another one this morning :evil:). BTW, have you experienced an issue due to "Command_Timeout" in the past?
crystalkeeper
 
Posts: 5
Joined: January 9th, 2016, 3:39 pm

Re: Enclosure resets periodically under Linux 4.3.3

Postby crystalkeeper » February 9th, 2016, 8:04 pm

Wanted to give an update in case anyone comes across this in the future.

I replaced the drive with the large "Command_Timeout." That helped things somewhat but I soon started experiencing resets again.

Through another forum I found out that Linux itself will instigate a reset of the hardware if a certain timeout is hit:

Code: Select all
/sys/block/$DEV/device/timeout


This value was originally set to 30 seconds for all drives in my enclosure.

I also learned that drive firmware has an internal timeout that can sometimes be controlled via SMART, it's called scterc. This timeout is triggered when a drive is attempting to automatically correct a failed read. If the timeout hits, the drive returns read failure on the block. It's important that this timeout is less than or equal to the Linux timeout.

None of my drives support controlling their ERC timeout through SMART. I read that a typical timeout in that case is about 3 minutes. I just bumped it up to 300 seconds on all my drives. The resets have definitely started happening less frequently. The best part is that now some of my drives have started to visibly show errors through SMART.

At this point it seems two of my drives are experiencing intermittent failures. This is good because now I know which ones to replace. All of these drives were pretty cheap so in the future I will likely get more high quality drives.

I'll keep this thread up to date as I continue maintenance on this array and learn new things.
crystalkeeper
 
Posts: 5
Joined: January 9th, 2016, 3:39 pm


Return to H82-SU3S2 ProBox - For PC Users

Who is online

Users browsing this forum: No registered users and 1 guest