Skip to content

edge-tier=2 WDT storm on ESP32-S3 N16R8 — DSP task starves UDP sender (v0.6.5-esp32) #683

Description

@pjmatlock

edge-tier=2 WDT storm on ESP32-S3 N16R8 — DSP task starves UDP sender (v0.6.5-esp32)

Summary

On a clean v0.6.5-esp32 flash to an ESP32-S3 N16R8 (16 MB flash / 8 MB PSRAM), provisioning with --edge-tier 2 produces a sustained task_wdt storm on edge_dsp (CPU 1) within ~30 seconds of boot. The DSP task monopolizes core 1 and starves the UDP sender — measured 0 packets/s to a host UDP listener on the configured --target-ip:5005. The README and release notes for v0.6.5 state "boots cleanly at --edge-tier 2 with full vitals + edge DSP active," so this looks like an undocumented regression on the 16 MB / 8 MB PSRAM (N16R8) variant.

Switching to --edge-tier 1 is stable (~1.5 pps, vitals + presence work) but does not send raw CSI amplitudes, so the server-side pose model emits keypoints with confidence: 0.0 and the observatory pose view is empty.

Hardware

  • Board: ESP32-S3 dev board labelled "Gold Edition N16R8" (Waveshare-style, AMOLED SH8601 1.8" 368×448 display detected on boot; no FT3168 touch, no TCA9554)
  • Chip: ESP32-S3 (QFN56) rev v0.2, 8 MB embedded PSRAM, 16 MB flash
  • MAC: e8:f6:0a:a4:e1:ac
  • USB-Serial/JTAG, COM5 on Windows
  • WiFi: 2.4 GHz WPA2-PSK, RSSI -25 dBm to AP (Pakedge AN-810-AP-I-AC), channel 1/6

Firmware

  • Release: v0.6.5-esp32 (binaries flashed verbatim from firmware/esp32-csi-node/release_bins/bootloader.bin, partition-table.bin, ota_data_initial.bin, esp32-csi-node.bin)
  • Standard 16 MB partition variant (not -4mb)

Reproduction

# 1. Erase + flash (clean state)
python -m esptool --chip esp32s3 --port COM5 erase-flash
python -m esptool --chip esp32s3 --port COM5 --baud 460800 write-flash \
  0x0 bootloader.bin 0x8000 partition-table.bin \
  0xf000 ota_data_initial.bin 0x20000 esp32-csi-node.bin

# 2. Provision with --edge-tier 2
python provision.py --port COM5 --ssid "<my-2.4ghz-ssid>" --password "<pw>" \
  --target-ip <my-server-ip> --edge-tier 2

# 3. Observe serial — WDT storm begins within ~30s
# 4. Listen for UDP packets on <my-server-ip>:5005 → ~0 pps

Observed Serial Output

Clean boot succeeds and the chip enters streaming state, then within ~30 seconds the watchdog begins firing repeatedly:

I (7706) main: CSI streaming active → 192.168.5.11:5005 (edge_tier=2, OTA=ready, WASM=ready, mmWave=off)
E (98661) task_wdt: Task watchdog got triggered. The following tasks/users did not reset the watchdog in time:
E (98661) task_wdt:  - IDLE1 (CPU 1)
E (98661) task_wdt: Tasks currently running:
E (98661) task_wdt: CPU 0: IDLE0
E (98661) task_wdt: CPU 1: edge_dsp
E (98661) task_wdt: Print CPU 1 backtrace
Backtrace: 0x4037890F:0x3FC9D6A0 0x4037746D:0x3FC9D6C0 0x4200D225:0x3FCC9C60

The same three-frame backtrace recurs every ~5 s and is reproducible across multiple reset cycles.

Workarounds Attempted (All Still WDT)

Provision args Result
--edge-tier 2 (defaults: subk 32, vital_win 300, vital_int 1000) WDT storm, 0 pps
--edge-tier 2 --subk-count 8 WDT storm, 0 pps
--edge-tier 2 --subk-count 8 --vital-win 100 WDT storm, 0 pps
--edge-tier 2 --subk-count 8 --vital-win 100 --vital-int 5000 WDT storm, 0 pps
--edge-tier 1 OK, ~1.5 pps stats; vitals + presence work server-side
--edge-tier 0 Mislabeled in --help as "raw passthrough" but actually sends 0 packetsstream_sender_send is not invoked when tier=0 in edge_processing.c:1049-1052

Source Reading

firmware/esp32-csi-node/main/edge_processing.c edge_task() (lines 904-939) is correctly written — it has vTaskDelay(1) between frames in a batch and a 20 ms post-batch yield, with comments explicitly referencing prior watchdog fixes (#266, #321). So the hang appears to be deeper inside process_frame() (lines 710-898) or one of its callees on this board variant. Suspect candidates without instrumentation:

  • update_multi_person_vitals() (lines 476-550) with EDGE_PHASE_HISTORY_LEN-sized inner loops over up to EDGE_MAX_PERSONS groups
  • estimate_bpm_zero_crossing() on full 300-sample histories
  • WASM dispatch path at process_frame() lines 879-897 — wasm_runtime_on_frame() is called every frame when tier >= 2 and s_pkt_valid — could this block?

I haven't built from source to instrument this; flagging in case the maintainer recognizes it immediately.

Server-Side Evidence

ruvnet/wifi-densepose:latest Docker container, CSI_SOURCE=esp32. Listening on UDP 5005:

TOTAL: 0 packets, 0 bytes in 10.0s = 0.0 pps   # --edge-tier 2
TOTAL: 18 packets, 2212 bytes in 12.1s = 1.5 pps  # --edge-tier 1

With --edge-tier 1, /api/v1/nodes reports the node as motion_level: present_moving, person_count: 1, but status: stale (1.5 pps is below the freshness threshold). /api/v1/models/load succeeds for the bundled wifi-densepose-v1.rvf (13 KB, found at docker/wifi-densepose-v1.rvf), but pose keypoints in the WebSocket sensing_update stream all report confidence: 0.0 because nodes[].amplitude = [] and subcarrier_count = 0 in tier=1 payloads.

What Would Help

  1. Confirm whether v0.6.5-esp32 tier=2 was validated on the 16 MB / 8 MB PSRAM (N16R8) variant or only on a different board (e.g. 4 MB). The release_bins/ directory ships both a regular and -4mb set — maybe the 16 MB binary has a config divergence.
  2. If a fix lands, a release_bins rebuild with that diff would let people on N16R8 boards (a very common cheap board on Amazon/AliExpress) use the project as documented.

Happy to provide additional logs, run instrumented builds, or test pre-release binaries against this exact board if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingfirmwareESP32 firmware

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions