Problems with multiple RBMetal2SHPn devices failing at one site

Steve at Digitronics

29 Apr 2014 29 Apr '14

1:37 p.m.

We have lots of groove type devices out there, plastic and metal, but there is one installation where we are having consistent problems, and it is the only place we have had any problems. Over the last 6 months or so we have had kernel failures and script errors logged on four different devices at the same site, the last three being RBMetal2SHPns. The four devices have been installed at the same site with the last three as subsequent replacements for the prior unit because it was playing up. Typically, a device works ok for a while (up to a month) but then starts logging kernel faults and exhibiting other weird symptoms such as script failures, and vanishing scripts. Sometimes only a reboot or a power cycle will get a failed unit going again. The chances of their being actual faulty devices is now so vanishing small as to be discounted, so we are trying to figure out what it is at this site that could be causing the same persistent failures on the series of devices. The device mounting and sealing has been checked. The antenna VSWR has been checked. The antenna cabling has been checked. The CAT5 cabling has been checked. The PSU and POE injector have both been changed at different times. The PSU is on a UPS. The site is not subject to lightning strikes or subsequent voltage gradients. The device is only 15m from the POE injector. When it is working the wireless data throughput is as expected. The device is on a private CAN so cannot be publicly hacked. The times and frequency of failures do not follow any obvious throughput, temperature, humidity or time of day patterns. The unit at the other end of the link has never had a failure of any kind. We are struggling to think of any other possible site specific environmental or equipment influence(s) that could be causing these failures, and I am really hoping that someone on the list can give us some fresh ideas or can share the resolution to similar circumstances they have experienced. TIA. Steve.

Show replies by date

Andrew Cox

29 Apr 29 Apr

1:47 p.m.

New subject: Problems with multiple RBMetal2SHPn devices failing at one site

Hi Steve, - checked ram/resource graphs to see if it was perhaps hitting a memory leak and crashing - tried disabling additional services that were not in use to make sure something isn't causing the crashes (remove l7 filtering anywhere, disable conntrack, stop polling via SNMP for a period of time, disable all but winbox/ssh services) - (as an inverse to the previous) tried monitoring more information, voltage levels, cpu, interface errors, ambient temp - enabled watchdog timer with a ping/reboot target - added some netflow monitoring to report traffic through 1 or more of the units to catch any odd traffic around the time of the lockups - added firewall log & filter rules to drop any non-critical input-chain traffic hitting the units Just a couple of things I'd look to run through, depending on how easy they are to do in your current environment/setup. - Andrew On 29 April 2014 23:37, Steve at Digitronics <steve@digitronics.com.au>wrote:

...

We have lots of groove type devices out there, plastic and metal, but there is one installation where we are having consistent problems, and it is the only place we have had any problems.

Over the last 6 months or so we have had kernel failures and script errors logged on four different devices at the same site, the last three being RBMetal2SHPns. The four devices have been installed at the same site with the last three as subsequent replacements for the prior unit because it was playing up.

Typically, a device works ok for a while (up to a month) but then starts logging kernel faults and exhibiting other weird symptoms such as script failures, and vanishing scripts. Sometimes only a reboot or a power cycle will get a failed unit going again.

The chances of their being actual faulty devices is now so vanishing small as to be discounted, so we are trying to figure out what it is at this site that could be causing the same persistent failures on the series of devices.

The device mounting and sealing has been checked. The antenna VSWR has been checked. The antenna cabling has been checked. The CAT5 cabling has been checked. The PSU and POE injector have both been changed at different times. The PSU is on a UPS. The site is not subject to lightning strikes or subsequent voltage gradients. The device is only 15m from the POE injector. When it is working the wireless data throughput is as expected. The device is on a private CAN so cannot be publicly hacked. The times and frequency of failures do not follow any obvious throughput, temperature, humidity or time of day patterns. The unit at the other end of the link has never had a failure of any kind.

We are struggling to think of any other possible site specific environmental or equipment influence(s) that could be causing these failures, and I am really hoping that someone on the list can give us some fresh ideas or can share the resolution to similar circumstances they have experienced.

TIA.

Steve.

_______________________________________________ Public mailing list Public@talk.mikrotik.com.au http://talk.mikrotik.com.au/mailman/listinfo/public_talk.mikrotik.com.au

Steve at Digitronics

3:12 p.m.

New subject: Problems with multiple RBMetal2SHPn devices failing at one site

Hello Andrew, Thanks for the quick reply. This unit is one end of a link that forms half of a double link, using four units in total. All four units are configured essentially identically, that is except for names, keys, IPS and single routes. None of the other three units have ever exhibited any problems at all over the life of the link(s). Resources are all well within limits. CPU is idling. Device voltage and ambient temperatures are fine. Crashes can be days apart or less than an hour apart, regardless of traffic. Links are running NV2. No firewall, NAT or filter settings. No services other than SNMP, Winbox and SSH. I have a simple script running on the device at the problem location which checks the link integrity and cycles the wireless interface if it fails, and then reboots the device if three interface cycles doesn’t get the link back. I use a script because it also logs some stuff before it reboots. That said, we have had a least one instance of logging in to the device after an extended outage to find the script had vanished, but it came back after the reboot J I use Dude to monitor throughput, and there is no obvious throughput relationship to failures. We are now convinced it is something peculiar to the site, not the device. Four brand new devices don’t all fail in turn in the same ways in the same spot without there being something suspect about the spot … We are thinking weird things like gobs of RF into the antenna from somewhere else (invisible), or huge transients being induced into the POE bearing CAT5. All of which are a bit left field. Steve. From: Andrew Cox [mailto:andrew.cox@bigair.net.au] Sent: Tuesday, 29 April 2014 23:47 To: steve@digitronics.com.au; MikroTik Australia Public List Subject: Re: [MT-AU Public] Problems with multiple RBMetal2SHPn devices failing at one site Hi Steve, - checked ram/resource graphs to see if it was perhaps hitting a memory leak and crashing - tried disabling additional services that were not in use to make sure something isn't causing the crashes (remove l7 filtering anywhere, disable conntrack, stop polling via SNMP for a period of time, disable all but winbox/ssh services) - (as an inverse to the previous) tried monitoring more information, voltage levels, cpu, interface errors, ambient temp - enabled watchdog timer with a ping/reboot target - added some netflow monitoring to report traffic through 1 or more of the units to catch any odd traffic around the time of the lockups - added firewall log & filter rules to drop any non-critical input-chain traffic hitting the units Just a couple of things I'd look to run through, depending on how easy they are to do in your current environment/setup. - Andrew On 29 April 2014 23:37, Steve at Digitronics <steve@digitronics.com.au> wrote: We have lots of groove type devices out there, plastic and metal, but there is one installation where we are having consistent problems, and it is the only place we have had any problems. Over the last 6 months or so we have had kernel failures and script errors logged on four different devices at the same site, the last three being RBMetal2SHPns. The four devices have been installed at the same site with the last three as subsequent replacements for the prior unit because it was playing up. Typically, a device works ok for a while (up to a month) but then starts logging kernel faults and exhibiting other weird symptoms such as script failures, and vanishing scripts. Sometimes only a reboot or a power cycle will get a failed unit going again. The chances of their being actual faulty devices is now so vanishing small as to be discounted, so we are trying to figure out what it is at this site that could be causing the same persistent failures on the series of devices. The device mounting and sealing has been checked. The antenna VSWR has been checked. The antenna cabling has been checked. The CAT5 cabling has been checked. The PSU and POE injector have both been changed at different times. The PSU is on a UPS. The site is not subject to lightning strikes or subsequent voltage gradients. The device is only 15m from the POE injector. When it is working the wireless data throughput is as expected. The device is on a private CAN so cannot be publicly hacked. The times and frequency of failures do not follow any obvious throughput, temperature, humidity or time of day patterns. The unit at the other end of the link has never had a failure of any kind. We are struggling to think of any other possible site specific environmental or equipment influence(s) that could be causing these failures, and I am really hoping that someone on the list can give us some fresh ideas or can share the resolution to similar circumstances they have experienced. TIA. Steve. _______________________________________________ Public mailing list Public@talk.mikrotik.com.au http://talk.mikrotik.com.au/mailman/listinfo/public_talk.mikrotik.com.au

Paul Julian

10:49 p.m.

New subject: Problems with multiple RBMetal2SHPn devices.failing at one site

Hi Steve, I have had similar problems when restoring backups of configs to replacement devices, perhaps try a manual reconfigure after a netinstall if you have restored a backup to the device initially ? Regards Paul

...

On 30 Apr 2014, at 1:11 am, "Steve at Digitronics" <steve@digitronics.com.au> wrote:

Hello Andrew,

Thanks for the quick reply.

This unit is one end of a link that forms half of a double link, using four units in total. All four units are configured essentially identically, that is except for names, keys, IPS and single routes. None of the other three units have ever exhibited any problems at all over the life of the link(s).

Resources are all well within limits. CPU is idling. Device voltage and ambient temperatures are fine. Crashes can be days apart or less than an hour apart, regardless of traffic.

Links are running NV2. No firewall, NAT or filter settings. No services other than SNMP, Winbox and SSH.

I have a simple script running on the device at the problem location which checks the link integrity and cycles the wireless interface if it fails, and then reboots the device if three interface cycles doesn’t get the link back. I use a script because it also logs some stuff before it reboots. That said, we have had a least one instance of logging in to the device after an extended outage to find the script had vanished, but it came back after the reboot J

I use Dude to monitor throughput, and there is no obvious throughput relationship to failures.

We are now convinced it is something peculiar to the site, not the device. Four brand new devices don’t all fail in turn in the same ways in the same spot without there being something suspect about the spot …

We are thinking weird things like gobs of RF into the antenna from somewhere else (invisible), or huge transients being induced into the POE bearing CAT5. All of which are a bit left field.

Steve.

From: Andrew Cox [mailto:andrew.cox@bigair.net.au] Sent: Tuesday, 29 April 2014 23:47 To: steve@digitronics.com.au; MikroTik Australia Public List Subject: Re: [MT-AU Public] Problems with multiple RBMetal2SHPn devices failing at one site

Hi Steve,

- checked ram/resource graphs to see if it was perhaps hitting a memory leak and crashing

- tried disabling additional services that were not in use to make sure something isn't causing the crashes (remove l7 filtering anywhere, disable conntrack, stop polling via SNMP for a period of time, disable all but winbox/ssh services)

- (as an inverse to the previous) tried monitoring more information, voltage levels, cpu, interface errors, ambient temp

- enabled watchdog timer with a ping/reboot target

- added some netflow monitoring to report traffic through 1 or more of the units to catch any odd traffic around the time of the lockups

- added firewall log & filter rules to drop any non-critical input-chain traffic hitting the units

Just a couple of things I'd look to run through, depending on how easy they are to do in your current environment/setup.

- Andrew

On 29 April 2014 23:37, Steve at Digitronics <steve@digitronics.com.au> wrote:

We have lots of groove type devices out there, plastic and metal, but there is one installation where we are having consistent problems, and it is the only place we have had any problems.

Over the last 6 months or so we have had kernel failures and script errors logged on four different devices at the same site, the last three being RBMetal2SHPns. The four devices have been installed at the same site with the last three as subsequent replacements for the prior unit because it was playing up.

Typically, a device works ok for a while (up to a month) but then starts logging kernel faults and exhibiting other weird symptoms such as script failures, and vanishing scripts. Sometimes only a reboot or a power cycle will get a failed unit going again.

The chances of their being actual faulty devices is now so vanishing small as to be discounted, so we are trying to figure out what it is at this site that could be causing the same persistent failures on the series of devices.

The device mounting and sealing has been checked. The antenna VSWR has been checked. The antenna cabling has been checked. The CAT5 cabling has been checked. The PSU and POE injector have both been changed at different times. The PSU is on a UPS. The site is not subject to lightning strikes or subsequent voltage gradients. The device is only 15m from the POE injector. When it is working the wireless data throughput is as expected. The device is on a private CAN so cannot be publicly hacked. The times and frequency of failures do not follow any obvious throughput, temperature, humidity or time of day patterns. The unit at the other end of the link has never had a failure of any kind.

We are struggling to think of any other possible site specific environmental or equipment influence(s) that could be causing these failures, and I am really hoping that someone on the list can give us some fresh ideas or can share the resolution to similar circumstances they have experienced.

TIA.

Steve.

_______________________________________________ Public mailing list Public@talk.mikrotik.com.au http://talk.mikrotik.com.au/mailman/listinfo/public_talk.mikrotik.com.au

_______________________________________________ Public mailing list Public@talk.mikrotik.com.au http://talk.mikrotik.com.au/mailman/listinfo/public_talk.mikrotik.com.au

----- No virus found in this message. Checked by AVG - www.avg.com Version: 2014.0.4570 / Virus Database: 3920/7404 - Release Date: 04/27/14

Steve at Digitronics

30 Apr 30 Apr

1:47 a.m.

New subject: Problems with multiple RBMetal2SHPn devices.failing at one site

Hello Paul, Thanks for the suggestion. We looked at that in detail for the third unit. Netinstall and manual reconfiguration made no difference to the behaviour. It is hard to recall and list all the things we have tried over time but your suggestion should have been included .... Steve. -----Original Message----- From: Public [mailto:public-bounces@talk.mikrotik.com.au] On Behalf Of Paul Julian Sent: Wednesday, 30 April 2014 08:49 To: MikroTik Australia Public List Subject: Re: [MT-AU Public] Problems with multiple RBMetal2SHPn devices.failing at one site Hi Steve, I have had similar problems when restoring backups of configs to replacement devices, perhaps try a manual reconfigure after a netinstall if you have restored a backup to the device initially ? Regards Paul

...

On 30 Apr 2014, at 1:11 am, "Steve at Digitronics" <steve@digitronics.com.au> wrote:

Hello Andrew,

Thanks for the quick reply.

This unit is one end of a link that forms half of a double link, using four units in total. All four units are configured essentially identically, that is except for names, keys, IPS and single routes. None of the other three units have ever exhibited any problems at all over the life of the link(s).

Resources are all well within limits. CPU is idling. Device voltage and ambient temperatures are fine. Crashes can be days apart or less than an hour apart, regardless of traffic.

Links are running NV2. No firewall, NAT or filter settings. No services other than SNMP, Winbox and SSH.

I have a simple script running on the device at the problem location which checks the link integrity and cycles the wireless interface if it fails, and then reboots the device if three interface cycles doesnt get the link back. I use a script because it also logs some stuff before it reboots. That said, we have had a least one instance of logging in to the device after an extended outage to find the script had vanished, but it came back after the reboot J

I use Dude to monitor throughput, and there is no obvious throughput relationship to failures.

We are now convinced it is something peculiar to the site, not the device. Four brand new devices dont all fail in turn in the same ways in the same spot without there being something suspect about the spot

We are thinking weird things like gobs of RF into the antenna from somewhere else (invisible), or huge transients being induced into the POE bearing CAT5. All of which are a bit left field.

Steve.

From: Andrew Cox [mailto:andrew.cox@bigair.net.au] Sent: Tuesday, 29 April 2014 23:47 To: steve@digitronics.com.au; MikroTik Australia Public List Subject: Re: [MT-AU Public] Problems with multiple RBMetal2SHPn devices failing at one site

Hi Steve,

- checked ram/resource graphs to see if it was perhaps hitting a memory leak and crashing

- tried disabling additional services that were not in use to make sure something isn't causing the crashes (remove l7 filtering anywhere, disable conntrack, stop polling via SNMP for a period of time, disable all but winbox/ssh services)

- (as an inverse to the previous) tried monitoring more information, voltage levels, cpu, interface errors, ambient temp

- enabled watchdog timer with a ping/reboot target

- added some netflow monitoring to report traffic through 1 or more of the units to catch any odd traffic around the time of the lockups

- added firewall log & filter rules to drop any non-critical input-chain traffic hitting the units

Just a couple of things I'd look to run through, depending on how easy they are to do in your current environment/setup.

- Andrew

On 29 April 2014 23:37, Steve at Digitronics <steve@digitronics.com.au> wrote:

We have lots of groove type devices out there, plastic and metal, but there is one installation where we are having consistent problems, and it is the only place we have had any problems.

Over the last 6 months or so we have had kernel failures and script errors logged on four different devices at the same site, the last three being RBMetal2SHPns. The four devices have been installed at the same site with the last three as subsequent replacements for the prior unit because it was playing up.

Typically, a device works ok for a while (up to a month) but then starts logging kernel faults and exhibiting other weird symptoms such as script failures, and vanishing scripts. Sometimes only a reboot or a power cycle will get a failed unit going again.

The chances of their being actual faulty devices is now so vanishing small as to be discounted, so we are trying to figure out what it is at this site that could be causing the same persistent failures on the series of devices.

The device mounting and sealing has been checked. The antenna VSWR has been checked. The antenna cabling has been checked. The CAT5 cabling has been checked. The PSU and POE injector have both been changed at different times. The PSU is on a UPS. The site is not subject to lightning strikes or subsequent voltage gradients. The device is only 15m from the POE injector. When it is working the wireless data throughput is as expected. The device is on a private CAN so cannot be publicly hacked. The times and frequency of failures do not follow any obvious throughput, temperature, humidity or time of day patterns. The unit at the other end of the link has never had a failure of any kind.

We are struggling to think of any other possible site specific environmental or equipment influence(s) that could be causing these failures, and I am really hoping that someone on the list can give us some fresh ideas or can share the resolution to similar circumstances they have experienced.

TIA.

Steve.

_______________________________________________ Public mailing list Public@talk.mikrotik.com.au http://talk.mikrotik.com.au/mailman/listinfo/public_talk.mikrotik.com.au

_______________________________________________ Public mailing list Public@talk.mikrotik.com.au http://talk.mikrotik.com.au/mailman/listinfo/public_talk.mikrotik.com.au

----- No virus found in this message. Checked by AVG - www.avg.com Version: 2014.0.4570 / Virus Database: 3920/7404 - Release Date: 04/27/14

Public mailing list Public@talk.mikrotik.com.au http://talk.mikrotik.com.au/mailman/listinfo/public_talk.mikrotik.com.au

Mike Everest

29 Apr 29 Apr

10:45 p.m.

New subject: Problems with multiple RBMetal2SHPn devices failing at one site

...

-----Original Message----- From: Public [mailto:public-bounces@talk.mikrotik.com.au] On Behalf Of Andrew Cox Sent: Tuesday, 29 April 2014 11:47 PM To: steve@digitronics.com.au; MikroTik Australia Public List Subject: Re: [MT-AU Public] Problems with multiple RBMetal2SHPn devices failing at one site

Hi Steve,

- checked ram/resource graphs to see if it was perhaps hitting a memory leak and crashing - tried disabling additional services that were not in use to make sure something isn't causing the crashes (remove l7 filtering anywhere, disable conntrack, stop polling via SNMP for a period of time, disable all but winbox/ssh services) - (as an inverse to the previous) tried monitoring more information, voltage levels, cpu, interface errors, ambient temp - enabled watchdog timer with a ping/reboot target - added some netflow monitoring to report traffic through 1 or more of the units to catch any odd traffic around the time of the lockups - added firewall log & filter rules to drop any non-critical input-chain

...

hitting the units

Just a couple of things I'd look to run through, depending on how easy

Hi Andrew, all,... The previous 2 failures display apparent physical hardware problem (even netinstall does not recover) and they have even been accepted for replacement by MikroTik (i.e. they are convinced it is hardware problem ;) Which makes it quite a mystery indeed! I have seen this /sort/ of thing happen a few times before, but manifested behaviour is either drop in tx/rx signals (radio damage) or dead internal power supply, e.g: - lightning damage - electrical interference (e.g. nearby refrigeration motors) - industrial/welding machinery Usually, the effect can be resolved by proper grounding (via both case-to-pole and shielded Ethernet cable) and/or physical relocation of the mounting point by a few meters in any direction. In this case, however, the problem is not a typical behaviour - maybe something different? Cheers! Mike. traffic they are

...

to do in your current environment/setup.

- Andrew

On 29 April 2014 23:37, Steve at Digitronics <steve@digitronics.com.au>wrote:

...
We have lots of groove type devices out there, plastic and metal, but there is one installation where we are having consistent problems, and it is the only place we have had any problems.

Over the last 6 months or so we have had kernel failures and script errors logged on four different devices at the same site, the last three being RBMetal2SHPns. The four devices have been installed at the same site with the last three as subsequent replacements for the prior unit because it was playing up.

Typically, a device works ok for a while (up to a month) but then starts logging kernel faults and exhibiting other weird symptoms such as script failures, and vanishing scripts. Sometimes only a reboot or a power cycle will get a failed unit going again.

The chances of their being actual faulty devices is now so vanishing small as to be discounted, so we are trying to figure out what it is at this site that could be causing the same persistent failures on the series of devices.

The device mounting and sealing has been checked. The antenna VSWR has been checked. The antenna cabling has been checked. The CAT5 cabling has been checked. The PSU and POE injector have both been changed at different times. The PSU is on a UPS. The site is not subject to lightning strikes or subsequent voltage gradients. The device is only 15m from the POE injector. When it is working the wireless data throughput is as expected. The device is on a private CAN so cannot be publicly hacked. The times and frequency of failures do not follow any obvious throughput, temperature, humidity or time of day patterns. The unit at the other end of the link has never had a failure of any kind.

We are struggling to think of any other possible site specific environmental or equipment influence(s) that could be causing these failures, and I am really hoping that someone on the list can give us some fresh ideas or can share the resolution to similar circumstances they have experienced.

TIA.

Steve.

_______________________________________________ Public mailing list Public@talk.mikrotik.com.au http://talk.mikrotik.com.au/mailman/listinfo/public_talk.mikrotik.com. au

_______________________________________________ Public mailing list Public@talk.mikrotik.com.au http://talk.mikrotik.com.au/mailman/listinfo/public_talk.mikrotik.com.au

Matt Chipman

10:21 p.m.

New subject: Problems with multiple RBMetal2SHPn devices failing at one site

Hi Steve, Vanishing scripts and kernel faults reminds me of faulty disks on a server. In this case the memory card corruption which is fixed on a reboot by the MT OS. Since this is repetitive, have you considered magnetism close by? -Matt On 29/04/2014 11:37 pm, "Steve at Digitronics" <steve@digitronics.com.au> wrote:

...

We have lots of groove type devices out there, plastic and metal, but there is one installation where we are having consistent problems, and it is the only place we have had any problems.

Over the last 6 months or so we have had kernel failures and script errors logged on four different devices at the same site, the last three being RBMetal2SHPns. The four devices have been installed at the same site with the last three as subsequent replacements for the prior unit because it was playing up.

Typically, a device works ok for a while (up to a month) but then starts logging kernel faults and exhibiting other weird symptoms such as script failures, and vanishing scripts. Sometimes only a reboot or a power cycle will get a failed unit going again.

The chances of their being actual faulty devices is now so vanishing small as to be discounted, so we are trying to figure out what it is at this site that could be causing the same persistent failures on the series of devices.

The device mounting and sealing has been checked. The antenna VSWR has been checked. The antenna cabling has been checked. The CAT5 cabling has been checked. The PSU and POE injector have both been changed at different times. The PSU is on a UPS. The site is not subject to lightning strikes or subsequent voltage gradients. The device is only 15m from the POE injector. When it is working the wireless data throughput is as expected. The device is on a private CAN so cannot be publicly hacked. The times and frequency of failures do not follow any obvious throughput, temperature, humidity or time of day patterns. The unit at the other end of the link has never had a failure of any kind.

We are struggling to think of any other possible site specific environmental or equipment influence(s) that could be causing these failures, and I am really hoping that someone on the list can give us some fresh ideas or can share the resolution to similar circumstances they have experienced.

TIA.

Steve.

_______________________________________________ Public mailing list Public@talk.mikrotik.com.au http://talk.mikrotik.com.au/mailman/listinfo/public_talk.mikrotik.com.au

4347

Age (days ago)

4348

Last active (days ago)

List overview

Download

6 comments

5 participants

participants (5)

Andrew Cox
Matt Chipman
Mike Everest
Paul Julian
Steve at Digitronics