LoRa Packet forwarder stops sending packets if LGB is restarted

KarthikSubramanian2 · April 15, 2019, 8:46pm

I am currently facing the below issue:
When the LoRa gateway bridge is restarted while a gateway is sending its stats packets, it seems that the stats packets are not getting consumed unless the packet forwarder is restarted again.

Has anyone faced this issue before?

Conduit - Multitech
stats interval - 30 seconds
LGB - deployed in K8s

When it is working, i see this log in the packet forwarder,

JSON up: {"stat":{"time":"2019-04-15 20:28:17 GMT","rxnb":1,"rxok":0,"rxfw":0,"ackr":100.0,"dwnb":0,"txnb":0}}
INFO: [up] PUSH_ACK received in 12 ms
INFO: [down] PULL_ACK received in 2 ms

but when the gateway bridge is not receiving any packets,

JSON up: {"stat":{"time":"2019-04-15 20:31:17 GMT","rxnb":0,"rxok":0,"rxfw":0,"ackr":0.0,"dwnb":0,"txnb":0}}

brocaar · April 16, 2019, 2:00pm

I have seen this before, and in that case the cause was a Docker networking issue.

KarthikSubramanian2 · April 16, 2019, 2:49pm

Yeah, when i just run the golang app, i don’t see this. It’s the container networking that seems to be the problem. I will post what i find out. Were you able to resolve it? or is there a workaround suggestion?

cstratton · April 16, 2019, 4:43pm

While it’s best to solve all known problems at their cause, for deployed gateways it’s good to have an overall watchdog, too.

Our gateways now have a task which does an independent MQTT subscription to their own stats messages. If those aren’t seen coming back from the broker (for whatever reason), eventually the gateway reboots. (Turns out though that rebooting was not always enough - I also had to add a wire to restart to the LTE modems, as those could stay in a bad state across computer reboots. Additionally, Linuxes can sometimes get in a state where they won’t reboot on command, so what our daemon actually does is ask for an orderly reboot and also stop feeding a low level watchdog which it configures on startup)

While I did that for a gateway design that runs the bridge locally, it should be possible even for one that doesn’t if you can get some sort of MQTT client to run on it. You could also potentially use something else as your roundtrip confirmation; I specifically chose not to use any part of the legacy protocol for that, as in the case of the local bridge, the UDP can have acks even when the bridge isn’t getting through to the broker.

KarthikSubramanian2 · April 17, 2019, 4:19pm

Thanks for your message. That’s what we are doing temporarily with a monitoring service on the gateway so that when this issue occurs, it will restart the packet forwarder with the help of autoquit_threshold variable.

I did try by killing the plain docker containers and restarting them and that seemed okay. It’s when the image is deployed to kubernetes that the issue happens.

I am going to try and take a deeper look at the K8s networking for UDP, and probably a tcpdump to see if i can infer something out of that.

I will post what i find out.

KarthikSubramanian2 · April 18, 2019, 8:49pm

We did manage to find out what was happening at least in my case if it helps someone else.

We deploy the lora gateway bridge in K8s and we found an issue where the conntrack entries were not getting updated and the packet forwarder kept hitting the stale entries thereby never making it to the pod.

And when the packet forwarder is restarted the conntrack entry gets updated which makes it work.

Here is an open issue about it - https://github.com/kubernetes/kubernetes/issues/66651