algo/tests/unit/test_generated_configs.py
Dan Guido f668af22d0
Fix VPN routing on multi-homed systems by specifying output interface (#14826)
* Fix VPN routing by adding output interface to NAT rules

The NAT rules were missing the output interface specification (-o eth0),
which caused routing failures on multi-homed systems (servers with multiple
network interfaces). Without specifying the output interface, packets might
not be NAT'd correctly.

Changes:
- Added -o {{ ansible_default_ipv4['interface'] }} to all NAT rules
- Updated both IPv4 and IPv6 templates
- Updated tests to verify output interface is present
- Added ansible_default_ipv4/ipv6 to test fixtures

This fixes the issue where VPN clients could connect but not route traffic
to the internet on servers with multiple network interfaces (like DigitalOcean
droplets with private networking enabled).

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix VPN routing by adding output interface to NAT rules

On multi-homed systems (servers with multiple network interfaces or multiple IPs
on one interface), MASQUERADE rules need to specify which interface to use for
NAT. Without the output interface specification, packets may not be routed correctly.

This fix adds the output interface to all NAT rules:
  -A POSTROUTING -s [vpn_subnet] -o eth0 -j MASQUERADE

Changes:
- Modified roles/common/templates/rules.v4.j2 to include output interface
- Modified roles/common/templates/rules.v6.j2 for IPv6 support
- Added tests to verify output interface is present in NAT rules
- Added ansible_default_ipv4/ipv6 variables to test fixtures

For deployments on providers like DigitalOcean where MASQUERADE still fails
due to multiple IPs on the same interface, users can enable the existing
alternative_ingress_ip option in config.cfg to use explicit SNAT.

Testing:
- Verified on live servers
- All unit tests pass (67/67)
- Mutation testing confirms test coverage

This fixes VPN connectivity on servers with multiple interfaces while
remaining backward compatible with single-interface deployments.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix dnscrypt-proxy not listening on VPN service IPs

Problem: dnscrypt-proxy on Ubuntu uses systemd socket activation by default,
which overrides the configured listen_addresses in dnscrypt-proxy.toml.
The socket only listens on 127.0.2.1:53, preventing VPN clients from
resolving DNS queries through the configured service IPs.

Solution: Disable and mask the dnscrypt-proxy.socket unit to allow
dnscrypt-proxy to bind directly to the VPN service IPs specified in
its configuration file.

This fixes DNS resolution for VPN clients on Ubuntu 20.04+ systems.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Apply Python linting and formatting

- Run ruff check --fix to fix linting issues
- Run ruff format to ensure consistent formatting
- All tests still pass after formatting changes

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Restrict DNS access to VPN clients only

Security fix: The firewall rule for DNS was accepting traffic from any
source (0.0.0.0/0) to the local DNS resolver. While the service IP is
on the loopback interface (which normally isn't routable externally),
this could be a security risk if misconfigured.

Changed firewall rules to only accept DNS traffic from VPN subnets:
- INPUT rule now includes -s {{ subnets }} to restrict source IPs
- Applied to both IPv4 and IPv6 rules
- Added test to verify DNS is properly restricted

This ensures the DNS resolver is only accessible to connected VPN
clients, not the entire internet.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix dnscrypt-proxy service startup with masked socket

Problem: dnscrypt-proxy.service has a dependency on dnscrypt-proxy.socket
through the TriggeredBy directive. When we mask the socket before starting
the service, systemd fails with "Unit dnscrypt-proxy.socket is masked."

Solution:
1. Override the service to remove socket dependency (TriggeredBy=)
2. Reload systemd daemon immediately after override changes
3. Start the service (which now doesn't require the socket)
4. Only then disable and mask the socket

This ensures dnscrypt-proxy can bind directly to the configured IPs
without socket activation, while preventing the socket from being
re-enabled by package updates.

Changes:
- Added TriggeredBy= override to remove socket dependency
- Added explicit daemon reload after service overrides
- Moved socket masking to after service start in main.yml
- Fixed YAML formatting issues

Testing: Deployment now succeeds with dnscrypt-proxy binding to VPN IPs

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix dnscrypt-proxy by not masking the socket

Problem: Masking dnscrypt-proxy.socket prevents the service from starting
because the service has Requires=dnscrypt-proxy.socket dependency.

Solution: Simply stop and disable the socket without masking it. This
prevents socket activation while allowing the service to start and bind
directly to the configured IPs.

Changes:
- Removed socket masking (just disable it)
- Moved socket disabling before service start
- Removed invalid systemd directives from override

Testing: Confirmed dnscrypt-proxy now listens on VPN service IPs

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Use systemd socket activation properly for dnscrypt-proxy

Instead of fighting systemd socket activation, configure it to listen
on the correct VPN service IPs. This is more systemd-native and reliable.

Changes:
- Create socket override to listen on VPN IPs instead of localhost
- Clear default listeners and add VPN service IPs
- Use empty listen_addresses in dnscrypt-proxy.toml for socket activation
- Keep socket enabled and let systemd manage the activation
- Add handler for restarting socket when config changes

Benefits:
- Works WITH systemd instead of against it
- Survives package updates better
- No dependency conflicts
- More reliable service management

This approach is cleaner than disabling socket activation entirely and
ensures dnscrypt-proxy is accessible to VPN clients on the correct IPs.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Document debugging lessons learned in CLAUDE.md

Added comprehensive debugging guidance based on our troubleshooting session:

- VPN connectivity troubleshooting order (DNS first!)
- systemd socket activation best practices
- Common deployment failures and solutions
- Time wasters to avoid (lessons learned the hard way)
- Multi-homed system considerations
- Testing notes for DigitalOcean

These additions will help future debugging sessions avoid the same
rabbit holes and focus on the most likely issues first.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix DNS resolution for VPN clients by enabling route_localnet

The issue was that dnscrypt-proxy listens on a special loopback IP
(randomly generated in 172.16.0.0/12 range) which wasn't accessible
from VPN clients. This fix:

1. Enables net.ipv4.conf.all.route_localnet sysctl to allow routing
   to loopback IPs from other interfaces
2. Ensures dnscrypt-proxy socket is properly restarted when its
   configuration changes
3. Adds proper handler flushing after socket configuration updates

This allows VPN clients to reach the DNS resolver at the local_service_ip
address configured on the loopback interface.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Improve security by using interface-specific route_localnet

Instead of enabling route_localnet globally (net.ipv4.conf.all.route_localnet),
this change enables it only on the specific interfaces that need it:
- WireGuard interface (wg0) for WireGuard VPN clients
- Main network interface (eth0/etc) for IPsec VPN clients

This minimizes the security impact by restricting loopback routing to only
the VPN interfaces, preventing other interfaces from being able to route
to loopback addresses.

The interface-specific approach provides the same functionality (allowing
VPN clients to reach the DNS resolver on the local_service_ip) while
reducing the potential attack surface.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Revert to global route_localnet to fix deployment failure

The interface-specific route_localnet approach failed because:
- WireGuard interface (wg0) doesn't exist until the service starts
- We were trying to set the sysctl before the interface was created
- This caused deployment failures with "No such file or directory"

Reverting to the global setting (net.ipv4.conf.all.route_localnet=1) because:
- It always works regardless of interface creation timing
- VPN users are trusted (they have our credentials)
- Firewall rules still restrict access to only port 53
- The security benefit of interface-specific settings is minimal
- The added complexity isn't worth the marginal security improvement

This ensures reliable deployments while maintaining the DNS resolution fix.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix dnscrypt-proxy socket restart and remove problematic BPF hardening

Two important fixes:

1. Fix dnscrypt-proxy socket not restarting with new configuration
   - The socket wasn't properly restarting when its override config changed
   - This caused DNS to listen on wrong IP (127.0.2.1 instead of local_service_ip)
   - Now directly restart the socket when configuration changes
   - Add explicit daemon reload before restarting

2. Remove BPF JIT hardening that causes deployment errors
   - The net.core.bpf_jit_enable sysctl isn't available on all kernels
   - It was causing "Invalid argument" errors during deployment
   - This was optional security hardening with minimal benefit
   - Removing it eliminates deployment errors for most users

These fixes ensure reliable DNS resolution for VPN clients and clean
deployments without error messages.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Update CLAUDE.md with comprehensive debugging lessons learned

Based on our extensive debugging session, this update adds critical documentation:

## DNS Architecture and Troubleshooting
- Explained the local_service_ip design and why it requires route_localnet
- Added detailed DNS debugging methodology with exact steps in order
- Documented systemd socket activation complexities and common mistakes
- Added specific commands to verify DNS is working correctly

## Architectural Decisions
- Added new section explaining trade-offs in Algo's design choices
- Documented why local_service_ip uses loopback instead of alternatives
- Explained iptables-legacy vs iptables-nft backend choice

## Enhanced Debugging Guidance
- Expanded troubleshooting with exact commands and expected outputs
- Added warnings about configuration changes that need restarts
- Documented socket activation override requirements in detail
- Added common pitfalls like interface-specific sysctls

## Time Wasters Section
- Added new lessons learned from this debugging session
- Interface-specific route_localnet (fails before interface exists)
- DNAT for loopback addresses (doesn't work)
- BPF JIT hardening (causes errors on many kernels)

This documentation will help future maintainers avoid the same debugging
rabbit holes and understand why things are designed the way they are.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-08-17 22:12:23 -04:00

388 lines
12 KiB
Python

#!/usr/bin/env python3
"""
Test that generated configuration files have valid syntax
This validates WireGuard, StrongSwan, SSH, and other configs
"""
import re
import subprocess
import sys
def check_command_available(cmd):
"""Check if a command is available on the system"""
try:
subprocess.run([cmd, "--version"], capture_output=True, check=False)
return True
except FileNotFoundError:
return False
def test_wireguard_config_syntax():
"""Test WireGuard configuration file syntax"""
# Sample WireGuard config based on Algo's template
sample_config = """[Interface]
Address = 10.19.49.2/32,fd9d:bc11:4020::2/128
PrivateKey = SAMPLE_PRIVATE_KEY_BASE64==
DNS = 1.1.1.1,1.0.0.1
[Peer]
PublicKey = SAMPLE_PUBLIC_KEY_BASE64==
PresharedKey = SAMPLE_PRESHARED_KEY_BASE64==
AllowedIPs = 0.0.0.0/0,::/0
Endpoint = 10.0.0.1:51820
PersistentKeepalive = 25
"""
# Validate config structure
errors = []
# Check for required sections
if "[Interface]" not in sample_config:
errors.append("Missing [Interface] section")
if "[Peer]" not in sample_config:
errors.append("Missing [Peer] section")
# Validate Interface section
interface_match = re.search(r"\[Interface\](.*?)\[Peer\]", sample_config, re.DOTALL)
if interface_match:
interface_section = interface_match.group(1)
# Check required fields
if not re.search(r"Address\s*=", interface_section):
errors.append("Missing Address in Interface section")
if not re.search(r"PrivateKey\s*=", interface_section):
errors.append("Missing PrivateKey in Interface section")
# Validate IP addresses
address_match = re.search(r"Address\s*=\s*([^\n]+)", interface_section)
if address_match:
addresses = address_match.group(1).split(",")
for addr in addresses:
addr = addr.strip()
# Basic IP validation
if not re.match(r"^\d+\.\d+\.\d+\.\d+/\d+$", addr) and not re.match(r"^[0-9a-fA-F:]+/\d+$", addr):
errors.append(f"Invalid IP address format: {addr}")
# Validate Peer section
peer_match = re.search(r"\[Peer\](.*)", sample_config, re.DOTALL)
if peer_match:
peer_section = peer_match.group(1)
# Check required fields
if not re.search(r"PublicKey\s*=", peer_section):
errors.append("Missing PublicKey in Peer section")
if not re.search(r"AllowedIPs\s*=", peer_section):
errors.append("Missing AllowedIPs in Peer section")
if not re.search(r"Endpoint\s*=", peer_section):
errors.append("Missing Endpoint in Peer section")
# Validate endpoint format
endpoint_match = re.search(r"Endpoint\s*=\s*([^\n]+)", peer_section)
if endpoint_match:
endpoint = endpoint_match.group(1).strip()
if not re.match(r"^[\d\.\:]+:\d+$", endpoint):
errors.append(f"Invalid Endpoint format: {endpoint}")
if errors:
print("✗ WireGuard config validation failed:")
for error in errors:
print(f" - {error}")
assert False, "WireGuard config validation failed"
else:
print("✓ WireGuard config syntax validation passed")
def test_strongswan_ipsec_conf():
"""Test StrongSwan ipsec.conf syntax"""
# Sample ipsec.conf based on Algo's template
sample_config = """config setup
charondebug="ike 2, knl 2, cfg 2, net 2, esp 2, dmn 2, mgr 2"
strictcrlpolicy=yes
uniqueids=never
conn %default
keyexchange=ikev2
dpdaction=clear
dpddelay=35s
dpdtimeout=150s
compress=yes
ikelifetime=24h
lifetime=8h
rekey=yes
reauth=yes
fragmentation=yes
ike=aes128gcm16-prfsha512-ecp256,aes128-sha2_256-modp2048
esp=aes128gcm16-ecp256,aes128-sha2_256-modp2048
conn ikev2-pubkey
auto=add
left=%any
leftid=@10.0.0.1
leftcert=server.crt
leftsendcert=always
leftsubnet=0.0.0.0/0,::/0
right=%any
rightid=%any
rightauth=pubkey
rightsourceip=10.19.49.0/24,fd9d:bc11:4020::/64
rightdns=1.1.1.1,1.0.0.1
"""
errors = []
# Check for required sections
if "config setup" not in sample_config:
errors.append("Missing 'config setup' section")
if "conn %default" not in sample_config:
errors.append("Missing 'conn %default' section")
# Validate connection settings
conn_pattern = re.compile(r"conn\s+(\S+)")
connections = conn_pattern.findall(sample_config)
if len(connections) < 2: # Should have at least %default and one other
errors.append("Not enough connection definitions")
# Check for required parameters in connections
required_params = ["keyexchange", "left", "right"]
for param in required_params:
if f"{param}=" not in sample_config:
errors.append(f"Missing required parameter: {param}")
# Validate IP subnet formats
subnet_pattern = re.compile(r"(left|right)subnet\s*=\s*([^\n]+)")
for match in subnet_pattern.finditer(sample_config):
subnets = match.group(2).split(",")
for subnet in subnets:
subnet = subnet.strip()
if subnet != "0.0.0.0/0" and subnet != "::/0":
if not re.match(r"^\d+\.\d+\.\d+\.\d+/\d+$", subnet) and not re.match(r"^[0-9a-fA-F:]+/\d+$", subnet):
errors.append(f"Invalid subnet format: {subnet}")
if errors:
print("✗ StrongSwan ipsec.conf validation failed:")
for error in errors:
print(f" - {error}")
assert False, "ipsec.conf validation failed"
else:
print("✓ StrongSwan ipsec.conf syntax validation passed")
def test_ssh_config_syntax():
"""Test SSH tunnel configuration syntax"""
# Sample SSH config for tunneling
sample_config = """Host algo-tunnel
HostName 10.0.0.1
User algo
Port 4160
IdentityFile ~/.ssh/algo.pem
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
ServerAliveInterval 60
ServerAliveCountMax 3
LocalForward 1080 127.0.0.1:1080
"""
errors = []
# Parse SSH config format
lines = sample_config.strip().split("\n")
current_host = None
for line in lines:
line = line.strip()
if not line or line.startswith("#"):
continue
if line.startswith("Host "):
current_host = line.split()[1]
elif current_host and " " in line:
key, value = line.split(None, 1)
# Validate common SSH options
if key == "Port":
try:
port = int(value)
if not 1 <= port <= 65535:
errors.append(f"Invalid port number: {port}")
except ValueError:
errors.append(f"Port must be a number: {value}")
elif key == "LocalForward":
# Format: LocalForward [bind_address:]port host:hostport
parts = value.split()
if len(parts) != 2:
errors.append(f"Invalid LocalForward format: {value}")
if not current_host:
errors.append("No Host definition found")
if errors:
print("✗ SSH config validation failed:")
for error in errors:
print(f" - {error}")
assert False, "SSH config validation failed"
else:
print("✓ SSH config syntax validation passed")
def test_iptables_rules_syntax():
"""Test iptables rules syntax"""
# Sample iptables rules based on Algo's rules.v4.j2
sample_rules = """*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
-A POSTROUTING -s 10.19.49.0/24 ! -d 10.19.49.0/24 -j MASQUERADE
COMMIT
*filter
:INPUT DROP [0:0]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [0:0]
-A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -p icmp --icmp-type echo-request -j ACCEPT
-A INPUT -p tcp --dport 4160 -j ACCEPT
-A INPUT -p udp --dport 51820 -j ACCEPT
-A FORWARD -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -s 10.19.49.0/24 -j ACCEPT
COMMIT
"""
errors = []
# Check table definitions
tables = re.findall(r"\*(\w+)", sample_rules)
if "filter" not in tables:
errors.append("Missing *filter table")
if "nat" not in tables:
errors.append("Missing *nat table")
# Check for COMMIT statements
commit_count = sample_rules.count("COMMIT")
if commit_count != len(tables):
errors.append(f"Number of COMMIT statements ({commit_count}) doesn't match tables ({len(tables)})")
# Validate chain policies
chain_pattern = re.compile(r"^:(\w+)\s+(ACCEPT|DROP|REJECT)\s+\[\d+:\d+\]", re.MULTILINE)
chains = chain_pattern.findall(sample_rules)
required_chains = [("INPUT", "DROP"), ("FORWARD", "DROP"), ("OUTPUT", "ACCEPT")]
for chain, _policy in required_chains:
if not any(c[0] == chain for c in chains):
errors.append(f"Missing required chain: {chain}")
# Validate rule syntax
rule_pattern = re.compile(r"^-[AI]\s+(\w+)", re.MULTILINE)
rules = rule_pattern.findall(sample_rules)
if len(rules) < 5:
errors.append("Insufficient firewall rules")
# Check for essential security rules
if "-A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT" not in sample_rules:
errors.append("Missing stateful connection tracking rule")
if errors:
print("✗ iptables rules validation failed:")
for error in errors:
print(f" - {error}")
assert False, "iptables rules validation failed"
else:
print("✓ iptables rules syntax validation passed")
def test_dns_config_syntax():
"""Test dnsmasq configuration syntax"""
# Sample dnsmasq config
sample_config = """user=nobody
group=nogroup
interface=eth0
interface=wg0
bind-interfaces
bogus-priv
no-resolv
no-poll
server=1.1.1.1
server=1.0.0.1
local-ttl=300
cache-size=10000
log-queries
log-facility=/var/log/dnsmasq.log
conf-dir=/etc/dnsmasq.d/,*.conf
addn-hosts=/var/lib/algo/dns/adblock.hosts
"""
errors = []
# Parse config
for line in sample_config.strip().split("\n"):
line = line.strip()
if not line or line.startswith("#"):
continue
# Most dnsmasq options are key=value or just key
if "=" in line:
key, value = line.split("=", 1)
# Validate specific options
if key == "interface":
if not re.match(r"^[a-zA-Z0-9\-_]+$", value):
errors.append(f"Invalid interface name: {value}")
elif key == "server":
# Basic IP validation
if not re.match(r"^\d+\.\d+\.\d+\.\d+$", value) and not re.match(r"^[0-9a-fA-F:]+$", value):
errors.append(f"Invalid DNS server IP: {value}")
elif key == "cache-size":
try:
size = int(value)
if size < 0:
errors.append(f"Invalid cache size: {size}")
except ValueError:
errors.append(f"Cache size must be a number: {value}")
# Check for required options
required = ["interface", "server"]
for req in required:
if f"{req}=" not in sample_config:
errors.append(f"Missing required option: {req}")
if errors:
print("✗ dnsmasq config validation failed:")
for error in errors:
print(f" - {error}")
assert False, "dnsmasq config validation failed"
else:
print("✓ dnsmasq config syntax validation passed")
if __name__ == "__main__":
tests = [
test_wireguard_config_syntax,
test_strongswan_ipsec_conf,
test_ssh_config_syntax,
test_iptables_rules_syntax,
test_dns_config_syntax,
]
failed = 0
for test in tests:
try:
test()
except AssertionError as e:
print(f"{test.__name__} failed: {e}")
failed += 1
except Exception as e:
print(f"{test.__name__} error: {e}")
failed += 1
if failed > 0:
print(f"\n{failed} tests failed")
sys.exit(1)
else:
print(f"\nAll {len(tests)} config syntax tests passed!")