Re: [MT-AU Public] Problems with multiple RBMetal2SHPn devices failing at one site

29 Apr 2014

      Hello Andrew,

Thanks for the quick reply.

This unit is one end of a link that forms half of a double link, using four units in total. All four units are configured essentially identically, that is except for names, keys, IPS and single routes. None of the other three units have ever exhibited any problems at all over the life of the link(s).

Resources are all well within limits. CPU is idling. Device voltage and ambient temperatures are fine. Crashes can be days apart or less than an hour apart, regardless of traffic.

Links are running NV2. No firewall, NAT or filter settings. No services other than SNMP, Winbox and SSH.

I have a simple script running on the device at the problem location which checks the link integrity and cycles the wireless interface if it fails, and then reboots the device if three interface cycles doesn’t get the link back. I use a script because it also logs some stuff before it reboots. That said, we have had a least one instance of logging in to the device after an extended outage to find the script had vanished, but it came back after the reboot J

I use Dude to monitor throughput, and there is no obvious throughput relationship to failures.

We are now convinced it is something peculiar to the site, not the device. Four brand new devices don’t all fail in turn in the same ways in the same spot without there being something suspect about the spot …

We are thinking weird things like gobs of RF into the antenna from somewhere else (invisible), or huge transients being induced into the POE bearing CAT5. All of which are a bit left field.

Steve.

From: Andrew Cox [mailto:andrew.cox@bigair.net.au] 
Sent: Tuesday, 29 April 2014 23:47
To: steve@digitronics.com.au; MikroTik Australia Public List
Subject: Re: [MT-AU Public] Problems with multiple RBMetal2SHPn devices failing at one site

Hi Steve,

- checked ram/resource graphs to see if it was perhaps hitting a memory leak and crashing

- tried disabling additional services that were not in use to make sure something isn't causing the crashes (remove l7 filtering anywhere, disable conntrack, stop polling via SNMP for a period of time, disable all but winbox/ssh services)

- (as an inverse to the previous) tried monitoring more information, voltage levels, cpu, interface errors, ambient temp

- enabled watchdog timer with a ping/reboot target

- added some netflow monitoring to report traffic through 1 or more of the units to catch any odd traffic around the time of the lockups

- added firewall log & filter rules to drop any non-critical input-chain traffic hitting the units

Just a couple of things I'd look to run through, depending on how easy they are to do in your current environment/setup.

- Andrew

On 29 April 2014 23:37, Steve at Digitronics <steve@digitronics.com.au> wrote:

We have lots of groove type devices out there, plastic and metal, but there is one installation where we are having consistent
problems, and it is the only place we have had any problems.

Over the last 6 months or so we have had kernel failures and script errors logged on four different devices at the same site, the
last three being RBMetal2SHPns. The four devices have been installed at the same site with the last three as subsequent replacements
for the prior unit because it was playing up.

Typically, a device works ok for a while (up to a month) but then starts logging kernel faults and exhibiting other weird symptoms
such as script failures, and vanishing scripts. Sometimes only a reboot or a power cycle will get a failed unit going again.

The chances of their being actual faulty devices is now so vanishing small as to be discounted, so we are trying to figure out what
it is at this site that could be causing the same persistent failures on the series of devices.

The device mounting and sealing has been checked. The antenna VSWR has been checked. The antenna cabling has been checked. The CAT5
cabling has been checked. The PSU and POE injector have both been changed at different times. The PSU is on a UPS. The site is not
subject to lightning strikes or subsequent voltage gradients. The device is only 15m from the POE injector. When it is working the
wireless data throughput is as expected. The device is on a private CAN so cannot be publicly hacked. The times and frequency of
failures do not follow any obvious throughput,  temperature, humidity or time of day patterns. The unit at the other end of the link
has never had a failure of any kind.

We are struggling to think of any other possible site specific environmental or equipment influence(s) that could be causing these
failures, and I am really hoping that someone on the list can give us some fresh ideas or can share the resolution to similar
circumstances they have experienced.

TIA.

Steve.

_______________________________________________
Public mailing list
Public@talk.mikrotik.com.au
http://talk.mikrotik.com.au/mailman/listinfo/public_talk.mikrotik.com.au