fix: apply stricter early routing for base64 media to prevent SSE dat…#1544
Open
MoonSangJin wants to merge 3 commits intolangfuse:mainfrom
Open
fix: apply stricter early routing for base64 media to prevent SSE dat…#1544MoonSangJin wants to merge 3 commits intolangfuse:mainfrom
MoonSangJin wants to merge 3 commits intolangfuse:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR addresses the issue discussed in langfuse/langfuse#5659 where LLM SSE streams were inadvertently intercepted by the media manager, resulting in silent data loss (empty
mediaobjects).Changes Made
langfuse/_task_manager/media_manager.py..endswith(";base64")to ensure that only valid Base64 Data URIs (following RFC 2397) are routed to theLangfuseMediaparser.data: [DONE]) from being swallowed, allowing them to safely pass through and be returned as original strings.⏳ Note to reviewers:I am opening this PR now with the core logic fix. I will add the corresponding test cases totests/test_media_manager.pywithin the next 24 hours to prevent any regressions! 🚀This aligns with the existing architecture while fully resolving the data loss bug with minimal diff.
Related Issue
Closes langfuse/langfuse#5659
Related Discussion: langfuse/langfuse#9826
Important
Fixes routing in
media_manager.pyto prevent incorrect processing of SSE streams as media by enforcing stricter Base64 URI checks.media_manager.pyto strictly check for Base64 Data URIs using.endswith(";base64").test_media_manager.pyto prevent regressions.This description was created by
for 9e0030a. You can customize this summary. It will automatically update as commits are pushed.
Disclaimer: Experimental PR review
Greptile Summary
Fixes a data loss bug where SSE stream messages (e.g.,
data: [DONE],data: {"choices": [...]}) were incorrectly matched by thedata.startswith("data:")check in_find_and_process_media, causing them to be routed toLangfuseMediaand silently converted to empty media objects instead of being returned as original strings.The fix adds two additional conditions to the early routing check:
"," in data— ensures the string contains a comma separator (required by RFC 2397 data URI format)data.split(",", 1)[0].endswith(";base64")— ensures the header portion ends with;base64These checks align with the validation already performed inside
LangfuseMedia._parse_base64_data_uri(), but critically prevent the creation of aLangfuseMediaobject in the first place — which is where the data loss occurred (the original string was replaced by the media object regardless of whether parsing succeeded).Confidence Score: 4/5
"," in dataandendswith(";base64")) align perfectly with RFC 2397 and the existing validation inLangfuseMedia._parse_base64_data_uri(). The only reason for not scoring 5 is the absence of tests — the author acknowledged this and committed to adding them within 24 hours.Important Files Changed
;base64suffix in the header, preventing SSE streams and JSON payloads starting withdata:from being incorrectly consumed as media.Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A["_process_data_recursively(data)"] --> B{"isinstance(data, str)?"} B -- No --> G{"isinstance(data, LangfuseMedia)?"} B -- Yes --> C{"data.startswith('data:')?"} C -- No --> Z["Return data unchanged"] C -- Yes --> D{"',' in data?"} D -- No --> Z D -- Yes --> E{"header.endswith(';base64')?"} E -- No --> Z E -- Yes --> F["Create LangfuseMedia\nand process upload"] G -- Yes --> H["Process existing media"] G -- No --> I{"Anthropic/Vertex dict?"} I -- Yes --> J["Convert to LangfuseMedia"] I -- No --> K{"list or dict?"} K -- Yes --> L["Recurse into children"] K -- No --> Z style D fill:#90EE90,stroke:#333 style E fill:#90EE90,stroke:#333Last reviewed commit: 1224146
(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!