Skip to content

fix(sse): fix memory leaks in SSE stream cleanup and add memory telemetry#3378

Merged
waleedlatif1 merged 7 commits intostagingfrom
waleedlatif1/add-aws-monitoring
Feb 28, 2026
Merged

fix(sse): fix memory leaks in SSE stream cleanup and add memory telemetry#3378
waleedlatif1 merged 7 commits intostagingfrom
waleedlatif1/add-aws-monitoring

Conversation

@waleedlatif1
Copy link
Collaborator

Summary

  • Fix MCP events SSE missing cancel() handler — leaked intervals and pub/sub subscribers on disconnect
  • Fix workflow execute SSE cancel() not aborting execution — workflows continued running after client disconnect
  • Fix A2A serve abort listener missing { once: true } — listener closures held until GC
  • Cap event-buffer and stream-buffer pending arrays to prevent unbounded growth on Redis failure
  • Cap isolated-vm stderr accumulation at 64KB
  • Add periodic memory telemetry (heap, RSS, uptime) logged every 60s
  • Add active SSE connection counter by route for leak diagnostics

Type of Change

  • Bug fix

Testing

  • TypeScript compiles clean (bunx tsc --noEmit)
  • All 4,162 tests pass (bunx vitest run)
  • Lint clean (bun run lint)

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link

vercel bot commented Feb 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Feb 28, 2026 6:13pm

Request Review

@waleedlatif1 waleedlatif1 force-pushed the waleedlatif1/add-aws-monitoring branch from e7dec5f to cf2a497 Compare February 28, 2026 01:11
@waleedlatif1
Copy link
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Collaborator Author

@greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 28, 2026

Greptile Summary

This PR systematically addresses SSE memory leaks and adds observability for tracking them. The changes add missing cancel() handlers to all SSE endpoints, implement idempotent cleanup functions with guard flags to prevent double-decrement, and add { once: true } to abort event listeners to prevent listener closure leaks.

Key improvements:

  • Fixed MCP events SSE missing cancel() handler that leaked intervals and pub/sub subscribers
  • Added SSE connection tracking across all endpoints (a2a, mcp, wand, workflow execute/stream) for diagnostics
  • Capped pending arrays in event-buffer and stream-buffer to prevent unbounded growth during Redis failures
  • Capped isolated-vm stderr accumulation at 64KB
  • Added periodic memory telemetry (heap, RSS, active resources, SSE connections) logged every 60s
  • Used stable process.getActiveResourcesInfo() API instead of private Node.js APIs

Note on workflow execution behavior:
The PR preserves "run-on-leave" behavior for workflow executions - workflows continue running after client disconnect. This was an intentional design decision (see commit 1f0fa8a7), though the PR description may be unclear about this.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk - well-structured memory leak fixes with proper guard flags
  • All changes follow consistent patterns for SSE cleanup with idempotent guards, addresses real memory leak issues, uses stable Node.js APIs, and includes comprehensive telemetry for ongoing monitoring. The code is defensive and won't cause double-decrements or other cleanup issues.
  • No files require special attention - all implementations are clean and consistent

Important Files Changed

Filename Overview
apps/sim/lib/monitoring/memory-telemetry.ts New file implementing periodic memory telemetry logging with proper use of stable Node.js API
apps/sim/app/api/mcp/events/route.ts Added missing cancel() handler, cleanup function with SSE tracking, and proper abort listener cleanup
apps/sim/app/api/workflows/[id]/execute/route.ts Added SSE connection tracking with guard flags to prevent double-decrement
apps/sim/lib/execution/event-buffer.ts Added capping logic to prevent unbounded pending array growth during Redis failures
apps/sim/lib/copilot/orchestrator/stream-buffer.ts Added capping logic for pending events using config.eventLimit to prevent memory growth

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[SSE Stream Created] --> B[start called]
    B --> C[incrementSSEConnections]
    C --> D{Stream Active}
    
    D -->|Client Disconnects| E[abort event fires]
    D -->|Normal Flow| F[Stream Processing]
    
    E --> G{once: true}
    G --> H[cleanup called]
    
    D -->|Stream Cancelled| I[cancel handler]
    I --> H
    
    F -->|Complete/Error| J[finally block]
    J --> H
    
    H --> K{Guard Flag Check}
    K -->|Not Cleaned| L[Clear intervals/timeouts]
    L --> M[Unsubscribe listeners]
    M --> N[decrementSSEConnections]
    N --> O[Set cleaned=true]
    
    K -->|Already Cleaned| P[Skip - idempotent]
    
    O --> Q[Stream Closed]
    P --> Q
    
    style H fill:#90EE90
    style K fill:#FFE4B5
    style N fill:#87CEEB
Loading

Last reviewed commit: a3a7e17

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

@waleedlatif1 waleedlatif1 force-pushed the waleedlatif1/add-aws-monitoring branch from cf2a497 to 2b37d5b Compare February 28, 2026 01:16
@waleedlatif1 waleedlatif1 force-pushed the waleedlatif1/add-aws-monitoring branch from 2b37d5b to 979cae7 Compare February 28, 2026 01:17
@waleedlatif1
Copy link
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Collaborator Author

@greptile

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 28, 2026

Additional Comments (1)

apps/sim/app/api/workflows/[id]/execute/route.ts
cancel() closes eventWriter while start() execution is still in-flight

cancel() calls await eventWriter.close() (line 1170), which immediately sets closed = true and flushes pending events. However, start() is still executing asynchronously — it awaits executeWorkflowCore, which may not respond to the abort signal instantaneously (e.g., if it is mid-way through an external AI API call).

Any events that start()'s execution writes to eventWriter after cancel() has set closed = true but before executeWorkflowCore sees the abort and stops will be silently dropped (the writeCore guard returns early when closed is true). These events — including block-completion events or error details — won't be persisted to Redis and will be lost from the audit log.

Since close() already awaits all inflight writes tracked in inflightWrites via Promise.allSettled, the issue is only with writes that are initiated after close() sets closed = true. The window is small but real: between timeoutController.abort() (line 1168) and the moment executeWorkflowCore honours the signal, the execution may still emit events.

One approach is to not call eventWriter.close() from cancel() at all, and instead let start()'s existing finally block handle it after the execution has fully unwound. The abort signal will cause executeWorkflowCore to return quickly, and the finally block will then flush and close correctly with no lost events.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@waleedlatif1
Copy link
Collaborator Author

@greptile

@waleedlatif1
Copy link
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Collaborator Author

@greptile

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

@waleedlatif1 waleedlatif1 merged commit b42f80e into staging Feb 28, 2026
12 checks passed
@waleedlatif1 waleedlatif1 deleted the waleedlatif1/add-aws-monitoring branch February 28, 2026 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant