Loss of messages when cellular connection drops

Lancemt · February 18, 2019, 2:37am

Hi

I’m currently running a gateway lora-gateway-bridge v2.7.0 on a Raspi Zero with a RAK 833 and usb 4G modem. qos=1.

The gateway is in an area with poor cellular reception and the connection does drop out for up to a few hours.

Can the uplink messages received during that period be retained and published when the connection is restored?
Could you explain the intended behavior for an unreliable internet connection?

Many thanks

cstratton · February 22, 2019, 7:00pm

While there are ways you could do this, it is fundamentally a bit at odds with the idea of LoRaWAN.

It’s easy enough to retain raw LoRa packets on a gateway where you control the software, the problem is that in the LoRaWAN architecture gateways do not have the keys to decode application or even network level traffic, and are both unable and unauthorized to autonomously reply to any confirmation-requested transmissions, not to mention anything having to do with adaptive data rates or over the activation (re)joins or housekeeping. Essentially gateways are all but stateless translators between MQTT and LoRa-modulated RF - they don’t actually “know” anything.

Even if your nodes do keep transmitting without any replies throughout the gap in backhaul service, a LoRaWAN server isn’t set up to ingest “stale” traffic, especially if it arrives out of order, since things like the anti-replay-attack frame counters would be broken. So if you have any other gateways that might have reported the occasional packet through on time, you would have to disable those checks. I’m not quite sure what would happen if you tried to feed in stale data with them disabled (easy enough to try at MQTT level, though now that I think about it, having had a gateway reboot and get reconnected to LoRaServer faster than it got a new NTP time fix, it looks like packets with stale times may actually be accepted if the frames are still in order)

Or you could decode the stale data from the gap periods yourself, by any of knowing the original secrets of ABP nodes, querying OTAA session secrets from the server API, or by knowing the original secrets and watching the raw join traffic.

Also as a comment on “cellular” (etc) modems: you may want to invest more effort in managing them. If things get really stuck, a task which ultimately gives up and reboots your gateway’s computer does not necessarily reboot the modem - ended up running a wire from a GPIO to be able to power cycle the modem (fortunately the SBC in use already had a USB power switch chip, but it was setup for overcurrent protection only, with the enable pin having only a pullup until we explicitly wired it to a GPIO). Some also have a reset pin which might work, but that was going to be harder to get at in our case.

Lancemt · February 25, 2019, 2:00am

Could you elaborate on this further?

This isn’t a problem for us as we use ABP with no ADR or confirmations.

Yes I would like to know what happens in this case.

This is dealt with via connection check scripts and a load switch. The connection loss is due to low signal level so environmental factors is the main contributor to connection instability.

cstratton · February 25, 2019, 4:45am

A crude form of local retention would be extending the debug code in the packet forwarder to print out the packets at which point they end up in syslog as debug messages that could be grepped out by some other task.

Likely a better step would be to push them to some sort of local data store, and create a scheme for uploading them later.

In addition to retaining data received while offline, I’ve been interested in trying to make gateways transmit error messages via LoRa when their backhaul is lost, in the hope that another gateway might hear this and report it in. Since I couldn’t figure out a way to do either with the gateway bridge, I actually ending up throwing together my own crude version this afternoon with the idea that it might be a place to plug in such add-ons, and am running it a bit to try to get a sense of what the issues with that might be. Getting data flowing was pretty straightforward, mostly a matter of correcting typos in my json routines and then noticing that the time strings in the Semtech UDP stats messages mismatch everything else; but the real issues are probably not yet discovered… I have things set up so that if it fails, it will run lora-gateway-bridge on the next boot, which the watchdog should bring about soon-ish…