Skip to content

fix: apply stricter early routing for base64 media to prevent SSE dat…#1544

Open
MoonSangJin wants to merge 3 commits intolangfuse:mainfrom
MoonSangJin:fix/strict-base64-routing
Open

fix: apply stricter early routing for base64 media to prevent SSE dat…#1544
MoonSangJin wants to merge 3 commits intolangfuse:mainfrom
MoonSangJin:fix/strict-base64-routing

Conversation

@MoonSangJin
Copy link

@MoonSangJin MoonSangJin commented Feb 26, 2026

Description

This PR addresses the issue discussed in langfuse/langfuse#5659 where LLM SSE streams were inadvertently intercepted by the media manager, resulting in silent data loss (empty media objects).

Changes Made

  • Updated the early routing condition in langfuse/_task_manager/media_manager.py.
  • Added a strict check using .endswith(";base64") to ensure that only valid Base64 Data URIs (following RFC 2397) are routed to the LangfuseMedia parser.
  • This prevents JSON payloads or SSE streams (like data: [DONE]) from being swallowed, allowing them to safely pass through and be returned as original strings.

⏳ Note to reviewers:
I am opening this PR now with the core logic fix. I will add the corresponding test cases to tests/test_media_manager.py within the next 24 hours to prevent any regressions! 🚀

✅ Update: The test cases for the routing logic have been fully added to tests/test_media_manager.py in the latest commit! Ready for review.

This aligns with the existing architecture while fully resolving the data loss bug with minimal diff.

Related Issue

Closes langfuse/langfuse#5659

Related Discussion: langfuse/langfuse#9826


Important

Fixes routing in media_manager.py to prevent incorrect processing of SSE streams as media by enforcing stricter Base64 URI checks.

  • Behavior:
    • Updates early routing condition in media_manager.py to strictly check for Base64 Data URIs using .endswith(";base64").
    • Prevents JSON payloads or SSE streams from being incorrectly processed as media, allowing them to pass through as original strings.
  • Misc:
    • No test cases added yet, but planned for test_media_manager.py to prevent regressions.

This description was created by Ellipsis for 9e0030a. You can customize this summary. It will automatically update as commits are pushed.

Disclaimer: Experimental PR review

Greptile Summary

Fixes a data loss bug where SSE stream messages (e.g., data: [DONE], data: {"choices": [...]}) were incorrectly matched by the data.startswith("data:") check in _find_and_process_media, causing them to be routed to LangfuseMedia and silently converted to empty media objects instead of being returned as original strings.

The fix adds two additional conditions to the early routing check:

  • "," in data — ensures the string contains a comma separator (required by RFC 2397 data URI format)
  • data.split(",", 1)[0].endswith(";base64") — ensures the header portion ends with ;base64

These checks align with the validation already performed inside LangfuseMedia._parse_base64_data_uri(), but critically prevent the creation of a LangfuseMedia object in the first place — which is where the data loss occurred (the original string was replaced by the media object regardless of whether parsing succeeded).

  • No new tests are included yet (author noted they will follow within 24 hours)
  • No import changes; custom instruction about top-level imports is not violated

Confidence Score: 4/5

  • This PR is safe to merge — it's a minimal, targeted fix that adds stricter validation to an existing condition without changing the happy path for valid base64 data URIs.
  • The fix is logically correct and well-scoped. The new conditions ("," in data and endswith(";base64")) align perfectly with RFC 2397 and the existing validation in LangfuseMedia._parse_base64_data_uri(). The only reason for not scoring 5 is the absence of tests — the author acknowledged this and committed to adding them within 24 hours.
  • No files require special attention. The single changed file has a minimal, correct diff.

Important Files Changed

Filename Overview
langfuse/_task_manager/media_manager.py Adds stricter early routing for base64 data URI detection by requiring a comma separator and ;base64 suffix in the header, preventing SSE streams and JSON payloads starting with data: from being incorrectly consumed as media.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["_process_data_recursively(data)"] --> B{"isinstance(data, str)?"}
    B -- No --> G{"isinstance(data, LangfuseMedia)?"}
    B -- Yes --> C{"data.startswith('data:')?"}
    C -- No --> Z["Return data unchanged"]
    C -- Yes --> D{"',' in data?"}
    D -- No --> Z
    D -- Yes --> E{"header.endswith(';base64')?"}
    E -- No --> Z
    E -- Yes --> F["Create LangfuseMedia\nand process upload"]
    G -- Yes --> H["Process existing media"]
    G -- No --> I{"Anthropic/Vertex dict?"}
    I -- Yes --> J["Convert to LangfuseMedia"]
    I -- No --> K{"list or dict?"}
    K -- Yes --> L["Recurse into children"]
    K -- No --> Z

    style D fill:#90EE90,stroke:#333
    style E fill:#90EE90,stroke:#333
Loading

Last reviewed commit: 1224146

(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: python: "Data is not base64 encoded" on server sent events

1 participant