Skip to content

Data loss: push worker unconditionally clears dirty_since after wait_until_done, allowing pull reconciler to overwrite concurrent writes #18

Description

@miyannishar

Summary

The push worker unconditionally clears dirty_since = None after wait_until_done(), creating a up-to-30-second window where concurrent user writes can be silently lost. If a user writes to a file while the push worker is waiting for server processing, the pull reconciler can overwrite their latest edits with stale server content.

Root Cause

In push.rs, after a successful POST/PATCH + wait_until_done():

// push.rs:194-197
wait_until_done(api, remote_id, WAIT_DONE_MAX).await;  // blocks up to 30s
db.set_mirrored_state(ino, None, Some("done"), Some(now_ms()));
db.set_dirty_since(ino, None);   // UNCONDITIONAL CLEAR

set_dirty_since(ino, None) clears the dirty flag without checking whether a new write has occurred since the job was claimed. This affects all 3 push paths (PATCH, POST, and binary upload).

Reproduction Timeline

T=0s    User writes "version A" -> dirty_since = T0, push enqueued
T=0.5s  Push worker claims job, POSTs content to server
T=1s    wait_until_done() starts polling (up to 30s)

T=2s    User writes "version B" -> dirty_since = T2
        (push_queue pending slot correctly filled)

T=4s    wait_until_done() returns -> set_dirty_since(ino, None)
        Destroys T2's dirty flag!

T=5s    Pull reconciler runs, fetches server doc (still "version A")
        dirty_since = None -> no protection -> OVERWRITES local file
        "version B" is SILENTLY LOST

Compounding Issue: wait_until_done return value ignored

The return value of wait_until_done() (indicating timeout vs success) is completely ignored:

wait_until_done(api, remote_id, WAIT_DONE_MAX).await;  // return value discarded
db.set_mirrored_state(ino, None, Some("done"), Some(now_ms()));  // stamps "done" anyway

If the server times out, the code still stamps last_status = "done", causing the inflight poller to stop tracking the document. If the server subsequently fails, the error is never detected.

Why It's Hard to Catch

  1. The pending slot mechanism works correctly, the push queue properly coalesces writes
  2. dirty_since is refreshed by the user's write, but then destroyed 2s later by the push worker
  3. Only manifests when: server processing > 500ms AND user saves during that window AND a pull loop runs before the second push completes
  4. With autosave editors and the 30s WAIT_DONE_MAX + 5s pull interval, this is very plausible in production

Proposed Fix

Compare-and-swap: Capture dirty_since at job claim time, only clear if unchanged.

// In push_queue_claim_next: snapshot the current value
let dirty_since_at_claim = db.get_dirty_since(ino);

// After wait_until_done: conditional clear instead of unconditional
db.clear_dirty_since_if_unchanged(ino, dirty_since_at_claim);
// SQL: "UPDATE fs_inode SET dirty_since = NULL WHERE ino = ?1 AND dirty_since = ?2"

This ensures that if the user writes during the wait window, the new dirty_since is preserved and the pull reconciler correctly returns SkippedDirty.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions