Bulletproof pfSense Multi-WAN Failover: Stop Chasing Zombie ISP Links

Learn how to eliminate zombie ISP links by configuring multi-WAN failover, Policy Routing, and Gateway Groups in pfSense. Includes deep-dives into dpinger, pf.conf, and RMM automation strategies.

Bulletproof pfSense Multi-WAN Failover: Stop Chasing Zombie ISP Links

8 min. read


The Ticket: Multi-WAN Failover Implementation

The client is reporting "the internet is down," but your RMM shows the endpoint is still checking in via the local agent. Upon investigation, the primary fiber circuit is experiencing severe packet loss (a brown-out), yet the gateway interface remains "Up" because the ISP's modem is still powered on. The on-site tech plugged in a Cradlepoint LTE backup, but traffic isn't routing. We need to transition from a single-homed setup to a redundant Gateway Group with automated path selection to stop these garbage after-hours alerts.


Pre-Flight Check

  • Permissions: Full Administrative access to the pfSense WebGUI or root via SSH.
  • Tools: Secondary ISP handoff, Ethernet cabling, and external monitoring IPs (e.g., 8.8.8.8 or 1.1.1.1).
  • Impact: Moderate - Applying firewall rule changes will reset the state table, briefly dropping existing active sessions (VoIP calls, SSH sessions, RDP).

The Solution

1. Interface Assignment & Physical Layer

  • Navigate to Interfaces > Interface Assignments.
  • Select the available NIC port (e.g., igc1 or ix0) from the "Available network ports" dropdown. Click Add.
  • Click the new interface link (likely OPT1).
  • Enable: Check "Enable Interface."
  • Description: WAN_BACKUP.
  • IPv4 Configuration: Select DHCP (or Static IPv4 if provided by ISP).
  • Security: Check Block private networks and Block bogon networks.
  • Save and Apply Changes.

2. Define Gateway Monitoring

  • Navigate to System > Routing > Gateways.
  • Edit the Primary WAN Gateway. Set Monitor IP to 8.8.8.8.
  • Edit the WAN_BACKUP Gateway. Set Monitor IP to 1.1.1.1.
  • Note: Using unique public IPs for each gateway is mandatory to prevent routing table conflicts.

3. Construct the Gateway Group

  • Navigate to System > Routing > Gateway Groups. Click Add.
  • Group Name: FAILOVER_GROUP.
  • WAN_DHCP (Primary): Tier 1.
  • WAN_BACKUP_DHCP: Tier 2.
  • Trigger Level: Packet Loss or High Latency.
  • Save.

4. Policy Routing Implementation

  • Navigate to Firewall > Rules > LAN (and any other relevant VLANs).
  • Edit the Default allow LAN to any rule.
  • Scroll to Advanced Options > Gateway.
  • Select FAILOVER_GROUP from the dropdown.
  • Save and Apply.

5. Global DNS Configuration

  • Navigate to System > General Setup.
  • DNS Server 1: 8.8.8.8 | Gateway: WAN_DHCP.
  • DNS Server 2: 1.1.1.1 | Gateway: WAN_BACKUP_DHCP.
  • Uncheck Allow DNS server list to be overridden by DHCP/PPP on WAN.
  • Save.

The "Why" (Root Cause)

By default, the FreeBSD OS underlying pfSense relies on the system default routing table (netstat -rn). It blindly forwards packets to the single default gateway. When a primary ISP experiences a mid-line fiber cut but the local modem stays powered on, the physical NIC link remains active. This is the classic "zombie link." The OS does not know the internet is dead; it just knows the MAC address of the modem is still answering ARP requests. Standard gateway detection monitors this useless next-hop. Because the link state never changes to DOWN, pfSense never triggers a route rebuild.

To fix this, we implement Policy Routing. This bypasses the static system routing table entirely and forces the packet filter (pf) to evaluate traffic against a dynamic Gateway Group. The system actively probes an external IP. Once packet loss hits our threshold, it marks the tier dead, and pf instantly re-routes traffic out the secondary interface.


Under the Hood (Technical Deep Dive)

When you configure a Gateway Group in pfSense, you are spinning up multiple dpinger instances. This daemon is the core of the operation. If you SSH in and check /var/etc/, you will find the raw configuration files passing specific arguments for latency and packet loss thresholds to the dpinger daemon. It blasts ICMP echo requests to your configured Monitor IP (defaulting to a 500ms interval). When a threshold is breached, it triggers a "Member Down" event via the rc.gateway_alarm script, which rewrites the routing table and aggressively flushes the state table for that specific gateway.

What happens to the firewall rules? Policy routing fundamentally changes the raw /etc/pf.conf file. When you set a gateway inside a rule via the GUI, it injects a route-to directive into the generated configuration. For example: pass in on igc0 family inet from any to any route-to (igc1 192.168.1.1). This intercepts the packet before it ever reaches the standard FreeBSD Routing Information Base (RIB) and forces it out the designated interface.

Then there is the state table (pfctl -s state). When we failover, the source IP of the outbound packet changes. If you do not enable Reset all states if a gateway goes down in System > Advanced > Miscellaneous, existing TCP/UDP states will stubbornly try to route out the dead WAN. The firewall literally drops the packets because the interface path is dead, but the state remains open until it times out. Conversely, if you are Load Balancing (putting both WANs on Tier 1 instead of failover), you absolutely must use Sticky Connections. Otherwise, pf will round-robin outbound traffic, constantly changing the client's source IP. Secure web servers and banking portals will see this IP jump, instantly invalidate the session cookie, and boot the user out.


RMM & Automation Tips

Do not monitor this manually. Leverage your RMM’s shell script capabilities to audit gateway status across your fleet. Set up a custom RMM script that executes the pfSense developer shell command via SSH: pfSsh.php playback gatewaystatus. Parse the output for your backup WAN. If the backup link is active for more than 4 hours, trigger a "High" priority ticket in your PSA (ConnectWise/Autotask). The client is likely racking up massive overage charges on an LTE connection, and you need to intervene.

If you are using configuration management tools like Ansible to deploy config.xml files, use variables for your monitoring IPs, like [MonitorIP_Primary] and [MonitorIP_Backup]. Hardcoding 8.8.8.8 across 100 firewalls sitting on the same regional ISP node is a great way to get rate-limited by Google's ICMP protection, causing false-positive failovers across your entire client base.


Troubleshooting & Edge Cases

  • Edge Case 1: DNS Resolution Blackholing: If your DNS resolver (unbound) is configured to "Use local DNS" but is only bound to the Primary WAN interface, DNS lookups will fail locally even if the backup link is active. Fix this by explicitly assigning DNS servers to specific gateways in System > General Setup.
  • Edge Case 2: LTE High Jitter / Flapping: Cellular or satellite links inherently have high jitter. dpinger will often mark them as "Down" prematurely, causing routing table thrashing. Fix this by editing the Gateway, clicking Display Advanced, and jacking up the Latency/Loss Thresholds (e.g., 500ms/1000ms) to give the flaky connection more breathing room.