Monday, 15 February 2016

MikroTik Automatic Failover Two Gateways


There’s a million ways to do this on the wiki and the web but none of them fit my particular application.  Let me explain:
1.  The weak point in my network was an AirFiber 24 upstream from the tower I am connected to wirelessly.  This is the link that goes down in heavy rain causing an outage at our office to PROVIDER1.  We have a backup connection through a second provider that is slower but being 5GHz doesn’t drop in the rain, PROVIDER2.
The network is like this:
[MikroTik CCR1036-12G-4S]
—[RBSXT]—[RBOmniTikU-5HnD[—[AF24]—[PROVIDER1]
—[RBSXT]—[PROVIDER2]
2. Simple floating static routes with check gateway doesn’t help because on PROVIDER1 we never drop our 5GHz connection to the tower, it’s the upstream link that fails.
3. I tried recursive routes and it works but the failover was still lacking and seemed sporadic at best.
4. When failover did occur, the VOIP PBX would hold the connection open through the dead provider and some phones in the office wouldn’t work at all, rebooting the phone was the only solution. We tried a ton of solutions and never got it to work consistently.
The solution that works the best is as follows.  I am using a combination of static routes, firewall rules and Netwatch scripts. Here it is:
The Netwatch script watches 4.2.2.4 (a public DNS server). If it goes down:
  • It changes the distance on the default router to PROVIDER1 to 20 making it inactive.  Now all traffic defaults through PROVIDER2.
  • It emails me that the gateway has changed. Please not you must set up your email server IP, and any authentication in /tools e-mail first.
  • It clears any connections to my VOIP gateway, thereby causing them to re-establish, interestingly calls do not drop!
  • When pings return, it sets the distance on the default route through PROVIDER2 back to 1 making it the active route and then clears all connections to the VOIP gateway again.
/tool netwatch
add comment=CheckCon down-script="/ip route set [find comment=\"\
PROVIDER1\"] distance=20\r\
\n/ip route set [find comment=\"PROVIDER2\"] disabled=no\r\
\n/tool e-mail send to=\"YourEmailAddress\" body=\
\"Connection with PROVIDER1 Lost, Switched to PROVIDER2\" \
subject=\
\"Lost connection with PROVIDER1\"\r\
\n/ ip firewall connection remove [find dst-address=\"\
YourVoipGatewayIP\"]" host=4.2.2.4 interval=5s timeout=2s \
up-script="/ip route set [find comment=\"PROVIDER1\"] distan\
ce=1\r\
\n/ip route set [find comment=\"PROVIDER2\"] disabled=no\r\
\n/tool e-mail send to=\"YourEmailAddress\" body=\
\"Connection with PROVIDER1 Regained, Switched back to PROV\
IDER1\" subject=\"Regained connection with PROVIDER1\"\r\
\n/ip firewall connection remove [find dst-address=\"\
YourVOIPGatewayIP\"]"
Next we need to ensure we can only ping our test host through the PROVIDER1 connection.  This is done with a static route through PROVIDER1:
/ip route add 
comment="Force test pings through PROVIDER1" dst-address=4.2.2.4 /
gateway=199.21.228.153
Next we need to comment our default routes.
/ip route
add comment=PROVIDER1 distance=1 gateway=199.21.228.137 scope=\
11
add comment=PROVIDER2 distance=10 gateway=209.112.225.65
Next we need to ensure that no pings to our test ip go through PROVIDER1 only:
/ip firewall filter add chain=output comment=/
"Drop pings to 4.2.2.4 if they go through PROVIDER2" \
dst-address=4.2.2.4out-interface=ether2 action=drop
As I write this it is pouring rain outside and I have observed it go down 3-4 times and even with people on the phone, calls continue and we haven’t lost the network. I am loving this!