// field notes · real-world troubleshooting

How Wi-Fi engineers actually troubleshoot - not what the textbook describes

The methodology is correct. The field is messier. Here's what 15 years of 802.11 troubleshooting actually looks like - scope first, controller second, Wireshark last.

— Shankar K. · field notes, not textbook excerpts

THE GAP

CWNA teaches: identify → discover scope → define causes → narrow down → action plan → verify → document.

The field teaches: figure out if it's actually a Wi-Fi problem first.

The real first question

Before touching a controller, before opening Wireshark, before anything - ask one question:

Does it happen on wired too?

If the user can plug in and reproduce the same problem, you just saved yourself an hour. DNS failure, firewall rule, VLAN misconfiguration, upstream outage - all of these look like "Wi-Fi problems" to users because Wi-Fi is the only connection they ever see. This is the fastest triage step in wireless troubleshooting and the one most engineers skip when they're eager to dive into the controller.

The scope question that cuts resolution time in half

Once you've confirmed it's a Wi-Fi issue, scope it before doing anything else. Scope tells you where to start - and more importantly, where not to waste time.

SYMPTOM LIKELY CAUSE WHERE TO START
One user, one device Client-side Device, driver, saved profile
One user, all their devices Credential or account issue RADIUS logs, user account
Multiple users, one area AP or RF in that zone That specific AP's status
Multiple users, everywhere Infrastructure failure Controller, DHCP, RADIUS, upstream

A single user with a single device failing is almost never an AP problem. Don't start with the AP.

The 80% that never needs Wireshark

Most wireless tickets - roughly 80% in a well-run enterprise environment - resolve before a PCAP is ever opened.

WRONG PASSWORD / STALE PROFILE

The single most common cause of "I can't connect." If the network password changed recently and the user hasn't updated it, they'll hit a 4-way handshake timeout on every attempt. The fix is a 30-second conversation, not a packet capture.

DRIVER OUT OF DATE

Responsible for a surprising percentage of intermittent disconnections, especially after OS updates. Windows Update frequently touches wireless drivers without the user noticing. Check Device Manager first.

CORRUPTED SAVED PROFILE

The device remembers the network but has a corrupted or mismatched profile from a previous security configuration. Forget the network, reconnect fresh. Resolves in under two minutes.

AP DOWN OR REBOOTING

LEDs don't lie. An AP cycling through boot states explains why an entire floor can't connect. Check the controller - if the AP isn't registering, check PoE, then uplink, then hardware.

DHCP POOL EXHAUSTED

Classic symptom: client associates successfully, shows "connected," but has no internet. Check DHCP server - if the pool is full, no new clients get addresses. Reduce lease time or expand the pool.

USER IN A DEAD ZONE

They moved to a different part of the building. The signal is marginal. Their device is sticky and won't roam. A quick RSSI check on the controller tells you this in under a minute.

What the controller dashboard actually tells you

The wireless controller is your first stop on almost every real ticket - not Wireshark. Here's what to look for and what it means.

CLIENT EVENT LOG (BY MAC ADDRESS)

Look up the client's MAC and read the connection event history. You'll see association attempts, authentication outcomes, deauths with reason codes, and roaming events. This answers 60% of remaining questions after basic triage - and most engineers don't use it enough.

AP CLIENT COUNT + CHANNEL UTILIZATION

High client count + high channel utilization = capacity problem, not coverage problem. These two are often confused. Coverage problems need more APs. Capacity problems need load balancing or fewer SSIDs.

RETRY RATES PER AP

Retry rates above 15–20% signal something wrong - high CCI, marginal RSSI for some clients, or a specific problematic client dragging the AP down. The controller shows you this without a capture.

RADIUS AUTHENTICATION LOGS

For 802.1X networks, always check RADIUS logs before assuming the problem is on the wireless side. An Access-Reject from the RADIUS server looks identical to a Wi-Fi association failure from the user's perspective.

What "Wi-Fi is slow" actually means

"Slow" is the most common complaint and the least specific. Translate it before doing anything.

SLOW TO CONNECT

Authentication or DHCP is taking too long - not a throughput problem. Look at EAPOL timing, RADIUS response time, DHCP server latency.

SLOW ONCE CONNECTED

Actual throughput is degraded. Look at retry rates, MCS index, channel utilization, and whether the client is in a coverage hole.

SLOW FOR CERTAIN APPS

Probably not Wi-Fi at all. QoS policy, firewall rule, DNS resolution time, or application server issue. Eliminate the network with ping/iperf first.

SLOW AT CERTAIN TIMES

Almost always a capacity or channel utilization issue. Pull historical utilization data from the controller for the reported time window.

What "keeps disconnecting" actually means

"Disconnecting" has four completely different root causes depending on the pattern.

EVERY FEW SECONDS

Likely a roaming issue. Sticky client drops below usable RSSI, deauths, reconnects. Check RSSI at time of deauth in the controller event log.

AFTER EXACTLY N MINUTES

An idle timeout or session timeout is kicking the client. Check idle-timeout settings - especially for IoT and mobile devices.

RANDOMLY, NO PATTERN

Could be DFS radar causing a channel change, AP firmware bug, or a driver issue. Start with the AP event log - look for unexpected restarts or channel changes.

WHEN MOVING

Roaming failure. Client transitioning between APs but handoff isn't clean - PMKID caching failure, FT misconfiguration, or sticky client waiting too long to roam.

When to actually open Wireshark

The remaining 20% of tickets - the ones that don't resolve from basic triage and the controller dashboard - are where protocol analysis earns its value.

Open Wireshark when:

For what to look for once Wireshark is open → 802.11 Connection Failure PCAP Guide

The tools engineers actually use daily

CONTROLLER DASHBOARD

Aruba Central, Cisco Catalyst Center, Juniper Mist, Meraki, Ruckus One - whichever platform you're on, this is where you spend the majority of troubleshooting time. Know where the client event logs live. Know how to filter by MAC address. Know how to pull AP event history.

PING AND IPERF

Unglamorous. Indispensable. A ping to the default gateway confirms L3 reachability in five seconds. iperf between the client and a wired server separates a "slow Wi-Fi" complaint from a "slow application" complaint in under a minute.

WI-FI ANALYZER APP

A quick channel scan in the problem area tells you: what APs are visible, what RSSI the client sees, what channels are in use, whether there's CCI. Faster than pulling a survey tool for quick triage.

NETSH / AIRPORT UTILITY

netsh wlan show interfaces (Windows) and airport -I (macOS) show the current SSID, BSSID, signal strength, channel, and auth type. Often reveals a sticky client problem immediately.

RADIUS / NPS LOGS

Most engineers forget to check here. These live on the authentication server, not the wireless controller. Most 802.1X failures - certificate expiry, policy mismatch, wrong credentials - are visible here and invisible in the controller.

The unwritten rules

Tickets lie. Timestamps don't.

"It's been broken all day" often means "I noticed it two hours ago." The actual failure event might be a 3am AP reboot, a RADIUS cert that expired at midnight, or a DFS channel change at 6am. Always pull the controller event log for the 24 hours before the ticket.

One user ≠ Wi-Fi problem. Ten users ≠ AP problem.

Scale matters but it's not binary. Ten users affected in one room might be one failing AP. Ten users affected across three floors might be one RADIUS server returning errors. Let the scope table do the thinking, not the user count.

The controller says "connected." The PCAP says "Status Code 53."

These disagree more often than they should. Controller dashboards count an eventual successful re-auth as a successful roam. The PCAP shows the 1.5 seconds of dropped voice call in between. When a user says "the call keeps cutting out for a second" - believe the PCAP over the dashboard.

Intermittent doesn't mean random.

It means you haven't found the pattern yet. Ask: same time of day? Same location? Same application? Same device type? A problem that "just happens randomly" almost always ties to a roaming event, a DHCP lease renewal, or a RADIUS session timeout.

Reboot fixes symptoms, not causes.

Rebooting an AP clears transient state and often resolves the immediate complaint - but if the root cause is still there, the problem comes back. Document what you rebooted and why. Check whether it recurs within 24 hours.

When to escalate and to whom

CONDITION ESCALATE TO
AP won't register after rebootNetwork ops / hardware swap
RADIUS Access-Reject with no obvious causeIdentity / security team
DHCP pool exhausted repeatedlyServer / infrastructure team
Firmware bug confirmedVendor TAC
Intermittent interference, no identifiable sourceRF engineering / spectrum analysis
Problem persists across AP replacementController config audit

The certification vs. reality gap

CWNA gives you the vocabulary, the framework, and the protocol knowledge. That's real value - it makes you dangerous in the right way when you get to the deep end. But the cert teaches you the what of every failure mode. The field teaches you which ones you'll actually see, in what order to check them, and which three questions to ask the user before touching anything.

The gap closes with every ticket. The engineers who close it fastest are the ones who stay curious about what the frames are actually saying - even when the ticket resolves at the controller level. That habit is what separates someone who can troubleshoot from someone who can troubleshoot and explain exactly why the fix worked.

Go deeper

When basic triage and the controller don't give you the answer - the PCAP will.

Connection Failure PCAP Guide → Status & Reason Codes →

— Shankar K., Wi-Fi engineer, Irving TX
Building WiFi Analyser V2 · CWNA-109 in progress · one post every two weeks

// leave a comment
// share this page
← previous
next →
SK
Shankar K., Wi-Fi engineer, Irving TX
Building WiFi Analyser V2 · CWNA-109 in progress · one post every two weeks