π Languages: English | Deutsch
AI Assistant with Multi-LLM Support, Web Research & Voice Interface
AIfred Intelligence is an advanced AI assistant with automatic web research, multi-model support, and history compression for unlimited conversations.
For version history and recent changes, see CHANGELOG.md.
Upgrading from v2.49 or earlier? Run
rm -f data/agents.json data/blocked_domains.txtbeforegit pullβ these files are now git-tracked with a new format. See CHANGELOG v2.50.0 for details.
πΊ View Example Showcases - Exported chats (via Share Chat button) showcasing Multi-Agent debates, Chemistry, Math, Coding, and Web Research.
- Multi-Agent Debate System: AIfred + Sokrates + Salomo + Vision β configurable agents with personality toggles
- Multi-Backend Support: llama.cpp via llama-swap (GGUF), Ollama (GGUF), vLLM (AWQ), TabbyAPI (EXL2), Cloud APIs (Qwen, DeepSeek, Claude)
- Automatic Model Lifecycle: Zero-config model management β new models auto-discovered from Ollama/HuggingFace on service start, removed models auto-cleaned from config
- Vision/OCR Support: Image analysis with multimodal LLMs (DeepSeek-OCR, Qwen3-VL, Ministral-3), VL Follow-Up for image-related follow-up questions
- Image Crop Tool: Interactive crop before OCR/analysis (8-point handles, 4K auto-resize)
- 2-Model Architecture: Specialized Vision-LLM for OCR/analysis, Main-LLM for text tasks
- Thinking Mode: Chain-of-Thought reasoning for complex tasks (Qwen3, NemoTron, QwQ - llama.cpp, Ollama, vLLM)
- Harmony-Template Support: GPT-OSS-120B with official Harmony format (
<|channel|>analysis<|message|>) - Automatic Web Research: AI decides autonomously when research is needed
- History Compression: Intelligent compression at 70% context utilization
- Automatic Context Calibration: VRAM-aware context sizing per backend - Ollama (Binary Search + RoPE scaling 1.0x/1.5x/2.0x, hybrid CPU offload), llama.cpp (3-phase: GPU-only Binary Search β Speed variant with tensor-split optimization for multi-GPU β Hybrid NGL fallback)
- Voice Interface: Configurable STT (Whisper) and TTS (Edge TTS, XTTS v2 Voice Cloning, MOSS-TTS 1.7B Voice Cloning, DashScope Qwen3-TTS Cloud Streaming with Voice Cloning, Piper, espeak) with multiple voices, pitch control, smart filtering (code blocks, tables, LaTeX formulas excluded from speech), per-agent voice settings, gapless realtime audio playback (double-buffered HTML5 audio, seamless playback during LLM inference)
- Vector Cache: ChromaDB with multilingual Ollama embeddings (nomic-embed-text-v2-moe, CPU-only)
- Sampling Parameters Table: Per-agent control of Temperature, Top-K, Top-P, Min-P, Repeat-Penalty (Auto/Manual mode) β sampling params reset to llama-swap YAML defaults on restart, temperature persisted in settings.json
- Per-Backend Settings: Each backend remembers its preferred models (including Vision-LLM)
- User Authentication: Username + password login with whitelist-based registration, admin CLI for user management
- Session Persistence: Chat history tied to user accounts, accessible from any device after login
- Session Management: Chat list sidebar with LLM-generated titles, switch between sessions, delete old chats
- Share Chat: Export conversation as portable HTML file in new browser tab (KaTeX fonts embedded inline, embedded TTS audio output, works offline)
- HTML Preview: AI-generated HTML code opens directly in browser (new tab)
- LaTeX & Chemistry: KaTeX for math formulas, mhchem extension for chemistry (
\ce{H2O}, reactions) - π Massive Performance Improvements: Direct-IO reduces model loading from 60-90s to just 2 seconds (~45x faster!) - see Model Parameter Docs for all 200B+ optimizations (KV-Quant, Batch-Sizes, VRAM optimization)
AIfred supports various discussion modes with Sokrates (critic) and Salomo (judge):
| Mode | Flow | Who decides? |
|---|---|---|
| Standard | AIfred answers | β |
| Critical Review | AIfred β Sokrates (+ Pro/Contra) β STOP | User |
| Auto-Consensus | AIfred β Sokrates β Salomo (X rounds) | Salomo |
| Tribunal | AIfred β Sokrates (X rounds) β Salomo | Salomo (Verdict) |
Agents:
- π© AIfred - Butler & Scholar - answers questions (British butler style with subtle elegance)
- ποΈ Sokrates - Critical Philosopher - questions & challenges using the Socratic method
- π Salomo - Wise Judge - synthesizes arguments and makes final decisions
- π· Vision - Image Analyst - OCR and visual Q&A (inherits AIfred's personality)
Customizable Personalities:
- All agent prompts are plain text files in
prompts/de/andprompts/en/ - Agent configuration in
data/agents.jsonβ prompt paths, toggles, roles - Personality can be toggled on/off in UI settings (keeps identity, removes style)
- 3-layer prompt system: Identity (who) + Personality (how, optional) + Task (what)
- Easily create your own agents or modify existing personalities
- Multilingual: Agents respond in the user's language (German prompts for German, English prompts for all other languages)
Direct Agent Addressing (NEW in v2.10):
- Address Sokrates directly: "Sokrates, what do you think about...?" β Sokrates answers with Socratic method
- Address AIfred directly: "AIfred, explain..." β AIfred answers without Sokrates analysis
- Supports STT transcription variants: "Alfred", "Eifred", "AI Fred"
- Works at sentence end too: "Great explanation. Sokrates." / "Well done. Alfred!"
Intelligent Context Handling (v2.10.2):
- Multi-Agent messages use
role: systemwith[MULTI-AGENT CONTEXT]prefix - Speaker labels
[SOKRATES]:and[AIFRED]:preserved for LLM context - Prevents LLM from confusing agent exchanges with its own responses
- All prompts automatically receive current date/time for temporal queries
Perspective System (v2.10.3):
- Each agent sees the conversation from their own perspective
- Sokrates sees AIfred's answers as
[AIFRED]:(user role), his own asassistant - AIfred sees Sokrates' critiques as
[SOKRATES]:(user role), his own asassistant - Prevents identity confusion between agents during multi-round debates
βββββββββββββββββββββββββββββββββββββββββββ
β llm_history (stored) β
β β
β [AIFRED]: "Answer 1" β
β [SOKRATES]: "Critique" β
β [AIFRED]: "Answer 2" β
βββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββΌββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββ ββββββββββββ βββββββββββ
β AIfred β β Sokrates β β Salomo β
β calls β β calls β β calls β
ββββββ¬βββββ ββββββ¬ββββββ ββββββ¬βββββ
β β β
βΌ βΌ βΌ
βββββββββββ ββββββββββββ βββββββββββ
βassistantβ β user β β user β
β"Answ 1" β β[AIFRED]: β β[AIFRED]:β
β user β βassistant β β user β
β[SOKR].. β β"Critique"β β[SOKR].. β
βassistantβ β user β β user β
β"Answ 2" β β[AIFRED]: β β[AIFRED]:β
βββββββββββ ββββββββββββ βββββββββββ
One source, three views - depending on who is speaking.
Own messages = assistant (no label), others = user (with label).
Structured Critic Prompts (v2.10.3):
- Round number placeholder
{round_num}- Sokrates knows which round it is - Maximum 1-2 critique points per round
- Sokrates only critiques - never decides consensus (that's Salomo's job)
Temperature Control (v2.10.4):
- Auto mode: Intent-Detection determines base temperature (FACTUAL=0.2, MIXED=0.5, CREATIVE=1.1)
- Manual mode: Per-agent temperature in the sampling table
- Configurable Sokrates offset in Auto mode (default +0.2, capped at 1.0)
- All temperature settings in "LLM Parameters (Advanced)" collapsible
Sampling Parameter Persistence:
- Temperature: Persisted in
settings.json(per-agent, survives restart) - Top-K, Top-P, Min-P, Repeat-Penalty: NOT persisted β reset to model-specific defaults from llama-swap YAML config on every restart
- Model change: Resets ALL sampling parameters (including temperature) to YAML defaults
- Reset button (βΊ): Resets ALL sampling parameters (including temperature) to YAML defaults
Trialog Workflow (Auto-Consensus with Salomo):
βββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββ
β User ββββββΆβ π© AIfred ββββββΆβ ποΈ Sokrates β
β Query β β + [LGTM/WEITER]β β + [LGTM/WEITER] β
βββββββββββββββ βββββββββββββββββββ ββββββββββββ¬βββββββββββ
β
βββββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββ
β π Salomo β
β + [LGTM/WEITER] β
ββββββββββββ¬βββββββββββ
β
βββββββββββββββββ΄ββββββββββββββββ
βΌ βΌ
ββββββββββββββββββ βββββββββββββββββββ
β 2/3 or 3/3 β β Not enough votesβ
β = Consensus! β β = Next Round β
ββββββββββββββββββ βββββββββββββββββββ
Consensus Types (configurable in settings):
- Majority (2/3): Two of three agents must vote
[LGTM] - Unanimous (3/3): All three agents must vote
[LGTM]
Tribunal Workflow:
βββββββββββββββ βββββββββββββββββββββββββββββββββββββββ
β User ββββββΆβ π© AIfred β ποΈ Sokrates β
β Query β β Debate for X Rounds β
βββββββββββββββ ββββββββββββββββββββ¬βββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββ
β π Salomo - Final Verdict β
β Weighs both sides, decides winner β
βββββββββββββββββββββββββββββββββββββββ
Message Display Format:
Each message is displayed individually with its emoji and mode label:
| Role | Agent | Display Format | Example |
|---|---|---|---|
| User | β | π {Username} (right-aligned) | π User: "What is Python?" |
| Assistant | aifred |
π© AIfred [{Mode} R{N}] (left-aligned) | π© AIfred [Auto-Consensus: Refinement R2] |
| Assistant | sokrates |
ποΈ Sokrates [{Mode} R{N}] (left-aligned) | ποΈ Sokrates [Tribunal: Critique R1] |
| Assistant | salomo |
π Salomo [{Mode} R{N}] (left-aligned) | π Salomo [Tribunal: Verdict R3] |
| System | β | π Summary (collapsible inline) | π Summary #1 (5 Messages) |
Mode Labels:
- Standard responses: No label (clean display)
- Multi-Agent modes:
[{Mode}: {Action} R{N}]format- Mode:
Auto-Consensus,Tribunal,Critical Review - Action:
Refinement,Critique,Synthesis,Verdict - Round:
R1,R2,R3, etc.
- Mode:
Examples:
- Standard:
π© AIfred(no label) - Auto-Consensus R1:
π© AIfred [Auto-Consensus: Refinement R1] - Tribunal R2:
ποΈ Sokrates [Tribunal: Critique R2] - Final Verdict:
π Salomo [Tribunal: Verdict R3]
Prompt Files per Mode:
| Mode | Agent | Prompt File | Mode Label | Display Example |
|---|---|---|---|---|
| Standard | AIfred | aifred/system_rag or system_minimal |
β | π© AIfred |
| Direct AIfred | AIfred | aifred/direct |
Direct Response | π© AIfred [Direct Response] |
| Direct Sokrates | Sokrates | sokrates/direct |
Direct Response | ποΈ Sokrates [Direct Response] |
| Critical Review | Sokrates | sokrates/critic |
Critical Review | ποΈ Sokrates [Critical Review] |
| Critical Review | AIfred | aifred/system_minimal |
Critical Review: Refinement | π© AIfred [Critical Review: Refinement] |
| Auto-Consensus R{N} | Sokrates | sokrates/critic |
Auto-Consensus: Critique R{N} | ποΈ Sokrates [Auto-Consensus: Critique R2] |
| Auto-Consensus R{N} | AIfred | aifred/system_minimal |
Auto-Consensus: Refinement R{N} | π© AIfred [Auto-Consensus: Refinement R2] |
| Auto-Consensus R{N} | Salomo | salomo/mediator |
Auto-Consensus: Synthesis R{N} | π Salomo [Auto-Consensus: Synthesis R2] |
| Tribunal R{N} | Sokrates | sokrates/tribunal |
Tribunal: Attack R{N} | ποΈ Sokrates [Tribunal: Attack R1] |
| Tribunal R{N} | AIfred | aifred/defense |
Tribunal: Defense R{N} | π© AIfred [Tribunal: Defense R1] |
| Tribunal Final | Salomo | salomo/judge |
Tribunal: Verdict R{N} | π Salomo [Tribunal: Verdict R3] |
Note: All prompts are in prompts/de/ (German) and prompts/en/ (English)
UI Settings:
- Sokrates-LLM and Salomo-LLM separately selectable (can be different models)
- Max debate rounds (1-10, default: 3)
- Discussion mode in Settings panel
- π‘ Help icon opens modal with all modes overview
Thinking Support:
- All agents (AIfred, Sokrates, Salomo) support Thinking Mode
<think>blocks formatted as collapsibles
- Reflex Framework: React frontend generated from Python
- WebSocket Streaming: Real-time updates without polling
- Adaptive Temperature: AI selects temperature based on question type
- Token Management: Dynamic context window calculation
- VRAM-Aware Context: Automatic context sizing based on available GPU memory
- Debug Console: Comprehensive logging and monitoring
- ChromaDB Server Mode: Thread-safe vector DB via Docker (0.0 distance for exact matches)
- GPU Detection: Automatic detection and warnings for incompatible backend-GPU combinations (docs/GPU_COMPATIBILITY.md)
- Context Calibration: Intelligent per-model calibration for Ollama and llama.cpp
- Ollama: Binary search with automatic VRAM/Hybrid mode detection (512 token precision)
- Hybrid mode for CPU+GPU offload (MoE vs Dense detection, 3 GB RAM reserve)
- Auto-Hybrid threshold: VRAM-only < 16k tokens β switch to Hybrid
- llama.cpp (3-phase calibration for multi-GPU setups):
- Phase 1 (GPU-only): Binary search on
-cwithngl=99, stops llama-swap, tests on temp port- Small model shortcut: models with
native_context β€ 8192are tested directly (no binary search) - flash-attn auto-detection: startup failure β automatic retry without
--flash-attn, updates llama-swap YAML on success
- Small model shortcut: models with
- Phase 2 (Speed variant): Probe + binary search on
--tensor-split N:1at 32K context β probes from original split+2 whether a more aggressive GPU split is possible (e.g. 11:1 on dual-72GB GPUs). No headroom = 1-2 tests, headroom found = binary search upward to maximum. Creates a separatemodel-speedentry in llama-swap YAML config - Phase 3 (Hybrid fallback): If Phase 1 < 16K β NGL reduction to free VRAM for KV-cache
- Startup errors (unknown architecture, wrong CUDA version) are logged and never written as false calibration data
- Phase 1 (GPU-only): Binary search on
- Results cached in unified
data/model_vram_cache.json
- Ollama: Binary search with automatic VRAM/Hybrid mode detection (512 token precision)
- llama-swap Autoscan: Automatic model discovery on service start (
scripts/llama-swap-autoscan.py) β zero manual YAML editing required- Scans Ollama manifests β creates descriptive symlinks in
~/models/(e.g.,sha256-6335adf...βQwen3-14B-Q8_0.gguf) - Scans HuggingFace cache (
~/.cache/huggingface/hub/) β creates symlinks for downloaded GGUFs - VL models (with matching
mmproj-*.gguf) automatically get--mmprojargument - Compatibility test: each new model is briefly started with llama-server β unsupported architectures (e.g.
deepseekocr) are detected and excluded before being added to the config - Skip list (
~/.config/llama-swap/autoscan-skip.json): incompatible models are remembered, no re-test on every restart. Delete entry to re-test after a llama.cpp update - Detects new GGUFs and adds llama-swap config entries with optimal defaults (
-ngl 99,--flash-attn on,-ctk q8_0, etc.) - Automatically maintains
groups.main.membersin the YAML β all models share VRAM exclusivity without manual editing - Creates preliminary VRAM cache entries (calibration via UI adds
vram_used_mbmeasured while the model is loaded) - Creates
config.yamlfrom scratch if not present β no manual bootstrap required - Runs as
ExecStartPrein systemd service βollama pull modelorhf downloadis all it takes to add a model
- Scans Ollama manifests β creates descriptive symlinks in
- Ctx/Speed Switch: Per-agent toggle between two pre-calibrated variants (Ctx = max context, β‘ Speed = 32K + aggressive GPU split)
- RoPE 2x Extended Context: Optional extended calibration up to 2x native context limit
- Parallel Web Search: 2-3 optimized queries distributed in parallel across APIs (Tavily, Brave, SearXNG), automatic URL deduplication, optional self-hosted SearXNG
- Parallel Scraping: ThreadPoolExecutor scrapes 3-7 URLs simultaneously, first successful results are used
- Failed Sources Display: Shows unavailable URLs with error reasons (Cloudflare, 404, Timeout) - persisted in Vector Cache for cache hits
- PDF Support: Direct extraction from PDF documents (AWMF guidelines, PubMed PDFs) via PyMuPDF with browser-like User-Agent
AIfred supports 6 TTS engines with different trade-offs between quality, latency, and resource usage. Each engine was chosen for a specific use case after extensive experimentation.
| Engine | Type | Streaming | Quality | Latency | Resources |
|---|---|---|---|---|---|
| XTTS v2 | Local Docker | Sentence-level | High (voice cloning) | ~1-2s/sentence | ~2 GB VRAM |
| MOSS-TTS 1.7B | Local Docker | None (batch after bubble) | Excellent (best open-source) | ~18-22s/sentence | ~11.5 GB VRAM |
| DashScope Qwen3-TTS | Cloud API | Sentence-level | High (voice cloning) | ~1-2s/sentence | API key only |
| Piper TTS | Local | Sentence-level | Medium | <100ms | CPU only |
| eSpeak | Local | Sentence-level | Low (robotic) | <50ms | CPU only |
| Edge TTS | Cloud | Sentence-level | Good | ~200ms | Internet only |
Why multiple engines?
The search for the perfect TTS experience led through several iterations:
- Edge TTS was the first engine -- free, fast, decent quality, but limited voices and no voice cloning.
- XTTS v2 added high-quality voice cloning with multilingual support. Sentence-level streaming works well: while the LLM generates the next sentence, XTTS synthesizes the current one. However, it requires a Docker container and ~2 GB VRAM.
- MOSS-TTS 1.7B delivers the best speech quality of all open-source models (SIM 73-79%), but at a cost: ~18-22 seconds per sentence makes it unsuitable for streaming. Audio is generated as a batch after the complete response, which is acceptable for short answers but frustrating for longer ones.
- DashScope Qwen3-TTS adds cloud-based voice cloning via Alibaba Cloud's API. By default it uses sentence-level streaming (same as XTTS) for better intonation. A realtime WebSocket mode (word-level chunks, ~200ms first audio) is also implemented but disabled by default -- it trades slightly worse prosody for faster first-audio. To re-enable it, uncomment the WebSocket block in
state.py:_init_streaming_tts()(see code comment there). - Piper TTS and eSpeak serve as lightweight offline alternatives that work without Docker, GPU, or internet connection.
Playback Architecture:
- Visible HTML5
<audio>widget with blob-URL prefetching (next 2 chunks pre-fetched into memory) preservesPitch: truefor speed adjustments without chipmunk effect- Per-agent voice/pitch/speed settings (AIfred, Sokrates, Salomo can each have distinct voices)
- SSE-based audio streaming from backend to browser (persistent connection, 15s keepalive)
AIfred supports cloud LLM providers via OpenAI-compatible APIs:
| Provider | Models | API Key Variable |
|---|---|---|
| Qwen (DashScope) | qwen-plus, qwen-turbo, qwen-max | DASHSCOPE_API_KEY |
| DeepSeek | deepseek-chat, deepseek-reasoner | DEEPSEEK_API_KEY |
| Claude (Anthropic) | claude-3.5-sonnet, claude-3-opus | ANTHROPIC_API_KEY |
| Kimi (Moonshot) | moonshot-v1-8k, moonshot-v1-32k | MOONSHOT_API_KEY |
Features:
- Dynamic model fetching (models loaded from provider's
/modelsendpoint) - Token usage tracking (prompt + completion tokens displayed in debug console)
- Per-provider model memory (each provider remembers its last used model)
- Vision model filtering (excludes
-vlvariants from main LLM dropdown) - Streaming support with real-time output
Note: Cloud APIs don't require local GPU resources - ideal for:
- Testing larger models without hardware investment
- Mobile/laptop usage without dedicated GPU
- Comparing cloud vs local model quality
- Automatik-LLM (Intent Detection, Query Optimization, Addressee Detection): Medium instruct models recommended
- Recommended:
qwen3:14b(Q4 or Q8 quantization) - Better semantic understanding for complex addressee detection ("What does Alfred think about Salomo's answer?")
- Small 4B models may struggle with nuanced sentence semantics
- Thinking mode is automatically disabled for Automatik tasks (fast decisions)
- "(same as AIfred-LLM)" option available - uses the same model as AIfred without extra VRAM
- Recommended:
- Main LLM: Use larger models (14B+, ideally 30B+) for better context understanding and prompt following
- Both Instruct and Thinking models work well
- Enable "Thinking Mode" for chain-of-thought reasoning on complex tasks
- Language Note: Small models (4B-14B) may respond in English when RAG context contains predominantly English web content, even with German prompts. Models 30B+ reliably follow language instructions regardless of context language.
AIfred offers 4 different research modes, each using different strategies depending on requirements. Here's the detailed workflow for each mode:
| Mode | Min LLM Calls | Max LLM Calls | Typical Duration |
|---|---|---|---|
| Own Knowledge | 1 | 1 | 5-30s |
| Automatik (Cache Hit) | 0 | 0 | <1s |
| Automatik (Direct Answer) | 2 | 3 | 5-35s |
| Automatik (Web Research) | 4 | 5 | 15-60s |
| Quick Web Search | 3 | 4 | 10-40s |
| Deep Web Search | 3 | 4 | 15-60s |
Shared first step for all Research modes:
Intent + Addressee Detection
ββ LLM Call (Automatik-LLM) - combined in one call
ββ Prompt: intent_detection
ββ Response: "FAKTISCH|sokrates" | "KREATIV|" | "GEMISCHT|aifred"
ββ Temperature usage:
β ββ Auto-Mode: FAKTISCH=0.2, GEMISCHT=0.5, KREATIV=1.0
β ββ Manual-Mode: Intent ignored, manual value used
ββ Addressee: Direct agent addressing (sokrates/aifred/salomo)
When an agent is directly addressed, that agent is activated immediately, regardless of the selected Research mode or temperature setting.
Simplest mode: Direct LLM call without web research or AI decision.
Workflow:
1. Message Building
ββ Build from chat history
ββ Inject system_minimal prompt (with timestamp)
2. Model Preloading (Ollama only)
ββ backend.preload_model() - measures actual load time
ββ vLLM/TabbyAPI: Skip (already in VRAM)
3. Token Management
ββ estimate_tokens(messages, model_name)
ββ calculate_dynamic_num_ctx()
4. LLM Call - Main Response
ββ Model: Main-LLM (e.g., Qwen2.5-32B)
ββ Temperature: Manual (user setting)
ββ Streaming: Yes (real-time updates)
ββ TTFT + Tokens/s measurement
5. Format & Save
ββ format_thinking_process() for <think> tags
ββ Update chat history
6. History Compression (PRE-MESSAGE Check - BEFORE every LLM call)
ββ Trigger: 70% utilization of smallest context window
β ββ Multi-Agent: min_ctx of all agents is used
ββ Dual History: chat_history (UI) + llm_history (LLM, FIFO)
ββ Summaries appear inline in chat where compression occurred
LLM Calls: 1 Main-LLM + optional 1 Compression-LLM (if >70% context)
Async Tasks: None
Code: aifred/state.py Lines 974-1117
Most intelligent mode: AI decides autonomously whether web research is needed.
1. Query ChromaDB for similar questions
ββ Distance < 0.5: HIGH Confidence β Cache Hit
ββ Distance β₯ 0.5: CACHE_MISS β Continue
2. IF CACHE HIT:
ββ Answer directly from cache
ββ RETURN (0 LLM Calls!)
1. Query cache for RAG candidates (distance 0.5-1.2)
2. FOR EACH candidate:
ββ LLM Relevance Check (Automatik-LLM)
β ββ Prompt: rag_relevance_check
β ββ Options: temp=0.1, num_ctx=AUTOMATIK_LLM_NUM_CTX
ββ Keep if relevant
3. Build formatted context from relevant entries
1. Check for explicit research keywords:
ββ "search", "google", "research on the internet", etc.
2. IF keyword found:
ββ Trigger fresh web research (mode='deep' β 7 URLs)
ββ BYPASS Automatik decision
1. LLM Call - Research Decision + Query Generation (combined)
ββ Model: Automatik-LLM (e.g., Qwen3:4B)
ββ Prompt: research_decision.txt
β ββ Contains: Current date (for time-related queries)
β ββ Vision context if images attached
β ββ Structured JSON output
ββ Messages: β NO history (focused, unbiased decision)
ββ Options:
β ββ temperature: 0.2 (consistent decisions)
β ββ num_ctx: 12288 (AUTOMATIK_LLM_NUM_CTX) - only if Automatik β AIfred model
β ββ num_predict: 256
β ββ enable_thinking: False (fast)
ββ Response: {"web": true, "queries": ["EN query", "DE query 1", "DE query 2"]}
OR {"web": false}
2. Query Rules (if web=true):
ββ Query 1: ALWAYS in English (international sources)
ββ Query 2-3: In the language of the question
ββ Each query: 4-8 keywords
3. Parse decision:
ββ IF web=true: β Web Research with pre-generated queries
ββ IF web=false: β Direct LLM Answer (Phase 5)
Why no history for Decision-Making?
- Prevents bias from previous conversation context
- Decision based purely on current question + Vision data
- Ensures consistent, objective web research triggering
1. Model Preloading (Ollama only)
2. Build Messages
ββ From chat history
ββ Inject system_minimal prompt
ββ Optional: Inject RAG context (if found in Phase 2)
3. LLM Call - Main Response
ββ Model: Main-LLM
ββ Temperature: From Pre-Processing or manual
ββ Streaming: Yes
ββ TTFT + Tokens/s measurement
4. Format & Update History
ββ Metadata: "Cache+LLM (RAG)" or "LLM"
5. History Compression Check (same as Own Knowledge mode)
ββ Automatic compression at >70% context utilization
LLM Calls:
- Cache Hit: 0 + optional 1 Compression
- RAG Context: 2-6 + optional 1 Compression
- Web Research: 4-5 + optional 1 Compression
- Direct Answer: 2-3 + optional 1 Compression
| LLM Call | Model | Chat History | Vision JSON | Temperature |
|---|---|---|---|---|
| Decision-Making | Automatik | β No | β In prompt | 0.2 |
| Query-Optimization | Automatik | β Last 3 turns | β In prompt | 0.3 |
| RAG-Relevance | Automatik | β Indirect | β No | 0.1 |
| Intent-Detection | Automatik | β No | β No | Internal |
| Main Response | Main-LLM | β Full history | β In context | Auto/Manual |
Design Rationale:
- Decision-Making without history: Unbiased decision based purely on current query
- Query-Optimization with history: Context-aware search for follow-up questions
- Main-LLM with full history: Complete conversation context for coherent responses
Code: aifred/lib/conversation_handler.py
Fastest web research mode: Scrapes top 3 URLs in parallel, optimized for speed.
1. Check session-based cache
ββ IF cache hit: Use cached sources β Skip to Phase 4
ββ IF miss: Continue to Phase 2
1. LLM Call - Query Optimization
ββ Model: Automatik-LLM
ββ Prompt: query_optimization (+ Vision JSON if present)
ββ Messages: β
Last 3 history turns (for follow-up context)
ββ Options:
β ββ temperature: 0.3 (balanced for keywords)
β ββ num_ctx: min(8192, automatik_limit)
β ββ num_predict: 128 (keyword extraction)
β ββ enable_thinking: False
ββ Post-processing:
β ββ Extract <think> tags (reasoning)
β ββ Clean query (remove quotes)
β ββ Add temporal context (current year)
ββ Output: optimized_query, query_reasoning
2. Web Search (Multi-API with Fallback)
ββ Try: Brave Search API
ββ Fallback: Tavily Search API
ββ Fallback: SearXNG (local instance)
ββ Each API returns up to 10 URLs
ββ Deduplication across APIs
Why history for Query-Optimization?
- Enables context-aware follow-up queries (e.g., "Tell me more about that")
- Limited to 3 turns to keep prompt focused
- Vision JSON injected for image-based searches
1. Non-Scrapable Domain Filter (BEFORE URL Ranking)
ββ Config: data/blocked_domains.txt (easy to edit, one domain per line)
ββ Filters video platforms: YouTube, Vimeo, TikTok, Twitch, Rumble, etc.
ββ Filters social media: Twitter/X, Facebook, Instagram, LinkedIn
ββ Reason: These sites cannot be scraped effectively
ββ Debug log: "π« Blocked: https://youtube.com/..."
ββ Summary: "π« Filtered 6 non-scrapable URLs (video/social platforms)"
2. URL Ranking (Automatik-LLM)
ββ Input: ~22 URLs (after filtering) with titles and snippets
ββ Model: Automatik-LLM (num_ctx: 12K)
ββ Prompt: url_ranking.txt (EN only - output is numeric)
ββ Options:
β ββ temperature: 0.0 (deterministic ranking)
β ββ num_predict: 100 (short response)
ββ Output: "3,7,1,12,5,8,2" (comma-separated indices)
ββ Result: Top 7 (deep) or Top 3 (quick) URLs by relevance
3. Why LLM-based Ranking?
ββ Semantic understanding of query-URL relevance
ββ No maintenance of keyword lists or domain whitelists
ββ Adapts to any topic (universal)
ββ Better than first-come-first-served ordering
4. Skip Conditions:
ββ Direct URL mode (user provided URLs directly)
ββ Less than top_n URLs found
ββ No titles/snippets available (fallback to original order)
PARALLEL EXECUTION:
ββ ThreadPoolExecutor (max 5 workers)
β ββ Scrape Top 3/7 URLs (ranked by relevance)
β ββ Extract text content + word count
β
ββ Async Task: Main LLM Preload (Ollama only)
ββ llm_client.preload_model(model)
ββ Runs parallel to scraping
ββ vLLM/TabbyAPI: Skip (already loaded)
Progress Updates:
ββ Yield after each URL completion
Scraping Strategy (trafilatura + Playwright Fallback):
1. trafilatura (fast, lightweight)
ββ Direct HTTP request, HTML parsing
ββ Works for most static websites
2. IF trafilatura returns < 800 words:
ββ Playwright fallback (headless Chromium)
ββ Executes JavaScript, renders dynamic content
ββ For SPAs: React, Vue, Angular sites
3. IF download failed (404, timeout, bot-protection):
ββ NO Playwright fallback (pointless)
ββ Mark URL as failed with error reason
The 800-word threshold is configurable via PLAYWRIGHT_FALLBACK_THRESHOLD in config.py.
1. Build Context
ββ Filter successful scrapes (word_count > 0)
ββ build_context() - smart token limit aware
ββ Build system_rag prompt (with context + timestamp)
2. LLM Call - Final Response
ββ Model: Main-LLM
ββ Temperature: From Pre-Processing or manual
ββ Context: ~3 sources, 5K-10K tokens
ββ Streaming: Yes
ββ TTFT + Tokens/s measurement
3. Cache Decision (via Volatility Tag from Main LLM)
ββ Main LLM includes <volatility>DAILY/WEEKLY/MONTHLY/PERMANENT</volatility>
ββ Volatility determines TTL:
β ββ DAILY (24h): News, current events
β ββ WEEKLY (7d): Semi-current topics
β ββ MONTHLY (30d): Statistics, reports
β ββ PERMANENT (β): Timeless facts ("What is Python?")
ββ Semantic Duplicate Check (distance < 0.3 to existing entries)
β ββ IF duplicate: Delete old entry (ensures latest data)
ββ cache.add(query, answer, sources, metadata, ttl)
ββ Debug: "πΎ Answer cached (TTL: {volatility})"
4. Format & Update History
ββ Metadata: "(Agent: quick, {n} sources)"
5. History Compression Check (same as Own Knowledge mode)
ββ Automatic compression at >70% context utilization
LLM Calls:
- With Cache: 1-2 + optional 1 Compression
- Without Cache: 3-4 + optional 1 Compression
Async Tasks:
- Parallel URL scraping (3 URLs)
- Background LLM preload (Ollama only)
Code: aifred/lib/research/orchestrator.py + Submodules
Most thorough mode: Scrapes top 7 URLs in parallel for maximum information depth.
Workflow: Identical to Quick Web Search, with the following differences:
URL Scraping (parallel via ThreadPoolExecutor):
ββ Quick Mode: 3 URLs scraped β ~3 successful sources
ββ Deep Mode: 7 URLs scraped β ~5-7 successful sources
ββ Automatik: 7 URLs scraped (uses deep mode)
Note: "APIs" refers to search APIs (Brave, Tavily, SearXNG)
"URLs" refers to actual web pages being scraped
Parallel Execution:
ββ ThreadPoolExecutor (max 5 workers)
β ββ Scrape Top 7 URLs simultaneously
β ββ Continue until 5 successful OR all tried
β
ββ Async: Main LLM Preload (parallel)
Quick: ~5K-10K tokens context
Deep: ~10K-20K tokens context
β More sources = richer context
β Longer LLM inference (10-40s vs 5-30s)
LLM Calls: Identical to Quick (3-4 + optional 1 Compression) Async Tasks: More parallel URLs (7 vs 3) Trade-off: Higher quality vs longer duration History Compression: Like all modes - automatic at >70% context
USER INPUT
β
βΌ
βββββββββββββββββββββββ
β Research Mode? β
βββββββββββββββββββββββ
β
βββ "none" βββββββββββββββββββββββββ
β β
βββ "automatik" βββββββββββββββ β
β β β
βββ "quick" βββββββββββββββ β β
β β β β
βββ "deep" βββββββββββββ β β β
β β β β
βΌ βΌ βΌ βΌ
βββββββββββββββββββββ
β MODE HANDLER β
βββββββββββββββββββββ
β
βββββββββββββββββββββββββββΌβββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββββ ββββββββββββββββ βββββββββββββββ
β OWN β β AUTOMATIK β β WEB β
β KNOWLEDGEβ β (AI Decides) β β RESEARCH β
ββββββββββββ ββββββββββββββββ β (quick/deep)β
β β βββββββββββββββ
β βΌ β
β ββββββββββββββββββ β
β β Vector Cache β β
β β Check β β
β ββββββββββββββββββ β
β β β
β ββββββββββββββΌββββββββββββββ β
β β β β β
β βΌ βΌ βΌ β
β ββββββββββ βββββββββββ βββββββββββ β
β β CACHE β β RAG β β CACHE β β
β β HIT β β CONTEXT β β MISS β β
β β RETURN β β FOUND β β β β
β ββββββββββ βββββββββββ βββββββββββ β
β β β β
β β βΌ β
β β ββββββββββββββββ β
β β β Keyword β β
β β β Override? β β
β β ββββββββββββββββ β
β β β β β
β β NO YES β
β β β β β
β β β ββββββββ€
β β βΌ β
β β βββββββββββββββββββββ
β β β LLM Decision ββ
β β β (Automatik-LLM) ββ
β β β β History: NO ββ
β β β β
Vision JSON ββ
β β βββββββββββββββββββββ
β β β β β
β β NO YES β
β β β β β
β β β βββββββββ€
βΌ βΌ βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DIRECT LLM INFERENCE β
β 1. Build Messages (with/without RAG) β
β 2. Intent Detection (auto mode, β no history) β
β 3. Main LLM Call (streaming, β
FULL history) β
β 4. Format & Update History β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββ
β RESPONSE β
ββββββββββββ
WEB RESEARCH PIPELINE
βββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββ
β Session Cache? β
βββββββββββββββββββββ
β
βββββββββββββ΄βββββββββββββ
β β
βΌ βΌ
ββββββββββ βββββββββββββββββββββββββββ
β CACHE β β Query Optimization β
β HIT β β (Automatik-LLM) β
ββββββββββ β β
History: 3 turns β
β β
Vision JSON β
βββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Web Search β
β (Multi-API) β
β β ~30 URLs β
βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β URL Ranking β
β (Automatik-LLM) β
β β Top 3/7 URLs β
βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β PARALLEL TASKS β
βββββββββββββββββββ€
β β’ Scraping β
β (ranked URLs) β
β β’ LLM Preload β
β (async) β
βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Context Build β
βββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β Main LLM (streaming) β
β β
History: FULL β
β β
Vision JSON β
βββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Cache Storage β
β (TTL from LLM) β
βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β RESPONSE β
βββββββββββββββββββ
Core Entry Points:
aifred/state.py- Main state management, send_message()
Automatik Mode:
aifred/lib/conversation_handler.py- Decision logic, RAG context
Web Research Pipeline:
aifred/lib/research/orchestrator.py- Top-level orchestration (incl. URL ranking)aifred/lib/research/cache_handler.py- Session cacheaifred/lib/research/query_processor.py- Query optimization + searchaifred/lib/research/url_ranker.py- LLM-based URL relevance ranking (NEW)aifred/lib/research/scraper_orchestrator.py- Parallel scrapingaifred/lib/research/context_builder.py- Context building + LLM
Supporting Modules:
aifred/lib/vector_cache.py- ChromaDB semantic cacheaifred/lib/rag_context_builder.py- RAG context from cacheaifred/lib/intent_detector.py- Temperature selectionaifred/lib/agent_tools.py- Web search, scraping, context building
The Automatik-LLM uses dedicated prompts in prompts/{de,en}/automatik/ for various decisions:
| Prompt | Language | When Called | Purpose |
|---|---|---|---|
intent_detection.txt |
EN only | Pre-processing | Determine query intent (FACTUAL/MIXED/CREATIVE) and addressee |
research_decision.txt |
DE + EN | Phase 4 | Decide if web research needed + generate queries |
rag_relevance_check.txt |
DE + EN | Phase 2 (RAG) | Check if cached entry is relevant to current question |
followup_intent_detection.txt |
DE + EN | Cache follow-up | Detect if user wants more details from cache |
url_ranking.txt |
EN only | Phase 2.5 | Rank URLs by relevance (output: numeric indices) |
Language Rules:
- EN only: Output is structured/numeric (parseable), language doesn't affect result
- DE + EN: Output depends on user's language or requires semantic understanding in that language
Prompt Directory Structure:
prompts/
βββ de/
β βββ automatik/
β βββ research_decision.txt # German queries for German users
β βββ rag_relevance_check.txt # German semantic matching
β βββ followup_intent_detection.txt
βββ en/
βββ automatik/
βββ intent_detection.txt # Universal intent detection
βββ research_decision.txt # English queries (Query 1 always EN)
βββ rag_relevance_check.txt # English semantic matching
βββ followup_intent_detection.txt
βββ url_ranking.txt # Numeric output (indices)
AIfred provides a complete REST API for programmatic control - enabling remote operation via Cloud, automation systems, and third-party integrations.
The API acts as a Browser Remote Control - it doesn't run LLM inference itself, but injects messages into the browser session which then executes the full pipeline:
API Request βββ set_pending_message() βββ Browser polls (1s)
β
β
send_message() pipeline
β
β
Intent Detection, Multi-Agent, Research...
β
β
Response visible in Browser + API
Benefits:
- One code path: API uses exactly the same code as manual browser input
- All features work: Multi-Agent, Research Mode, Vision, History Compression
- Live feedback: User sees streaming output in browser while API waits
- Less code: No duplicate LLM logic in API layer
- Full Remote Control: Control all AIfred settings from anywhere
- Live Browser Sync: API changes automatically appear in the browser UI
- Message Injection: Queue messages that browser processes with full pipeline
- Session Management: Access and manage multiple browser sessions
- OpenAPI Documentation: Interactive Swagger UI at
/docs
The API enables pure remote control - messages are injected into browser sessions, the browser performs the full processing (Intent Detection, Multi-Agent, Research, etc.). The user sees everything live in the browser.
| Endpoint | Method | Description |
|---|---|---|
/api/health |
GET | Health check with backend status |
/api/settings |
GET | Retrieve all settings |
/api/settings |
PATCH | Update settings (partial update) |
/api/models |
GET | List available models |
/api/chat/inject |
POST | Inject message into browser session |
/api/chat/status |
GET | Check if inference is running (is_generating, message_count) |
/api/chat/history |
GET | Get chat history |
/api/chat/clear |
POST | Clear chat history |
/api/sessions |
GET | List all browser sessions |
/api/system/restart-ollama |
POST | Restart Ollama |
/api/system/restart-aifred |
POST | Restart AIfred |
/api/calibrate |
POST | Start context calibration |
When you change settings or inject messages via API, the browser UI updates automatically:
- Chat Sync: Injected messages trigger full inference in browser within 2 seconds
- Settings Sync: Model changes, Multi-Agent mode, temperature etc. update live in the UI
- Status Polling: Use
/api/chat/statusto wait for inference completion
This enables true remote control - change AIfred's configuration from another device and see the changes reflected immediately in any connected browser.
# Get current settings
curl http://localhost:8002/api/settings
# Change model and Multi-Agent mode
curl -X PATCH http://localhost:8002/api/settings \
-H "Content-Type: application/json" \
-d '{"aifred_model": "qwen3:14b", "multi_agent_mode": "critical_review"}'
# Inject a message (browser runs full pipeline)
curl -X POST http://localhost:8002/api/chat/inject \
-H "Content-Type: application/json" \
-d '{"message": "What is Python?", "device_id": "abc123..."}'
# Poll until inference is complete
curl "http://localhost:8002/api/chat/status?device_id=abc123..."
# Returns: {"is_generating": false, "message_count": 5}
# Get the response
curl "http://localhost:8002/api/chat/history?device_id=abc123..."
# List all browser sessions (to get device_id)
curl http://localhost:8002/api/sessions- Cloud Control: Operate AIfred from anywhere via HTTPS/API
- Home Automation: Integration with Home Assistant, Node-RED, etc.
- Voice Assistants: Alexa/Google Home can send AIfred queries
- Batch Processing: Automated queries via scripts
- Mobile Apps: Custom apps can use the API
- Remote Maintenance: Test and monitor AIfred on headless systems
- Python 3.10+
- LLM Backend (choose one):
- llama.cpp via llama-swap (GGUF models) - best performance, full GPU control (setup guide)
- Ollama (easy, GGUF models) - recommended for getting started
- vLLM (fast, AWQ models) - best performance for AWQ (requires Compute Capability 7.5+)
- TabbyAPI (ExLlamaV2/V3, EXL2 models) - experimental
Zero-Config Model Management (llama.cpp backend): After the initial setup, adding models requires no manual configuration. Just run
ollama pull modelorhf download ..., then restart llama-swap β the autoscan configures everything automatically (YAML entries, groups, VRAM cache). See docs/deployment.md for the full setup guide.
- 8GB+ RAM (12GB+ recommended for larger models)
- Docker (for ChromaDB Vector Cache)
- GPU: NVIDIA GPU recommended (see GPU Compatibility Guide)
- Clone repository:
git clone https://github.com/yourusername/AIfred-Intelligence.git
cd AIfred-Intelligence- Create virtual environment:
python3 -m venv venv
source venv/bin/activate # Linux/Mac
# or
venv\Scripts\activate # Windows- Install dependencies:
pip install -r requirements.txt
# Install Playwright browser (for JS-heavy pages)
playwright install chromiumMain Dependencies (see requirements.txt):
| Category | Packages |
|---|---|
| Framework | reflex, fastapi, pydantic |
| LLM Backends | httpx, openai, pynvml, psutil |
| Web Research | trafilatura, playwright, requests, pymupdf |
| Vector Cache | chromadb, ollama, numpy |
| Audio (STT/TTS) | edge-tts, XTTS v2 (Docker), openai-whisper |
- Environment variables (.env):
# API Keys for web research
BRAVE_API_KEY=your_key_here
TAVILY_API_KEY=your_key_here
# Ollama configuration
OLLAMA_BASE_URL=http://localhost:11434
# Cloud LLM API Keys (optional - only needed if using cloud backends)
DASHSCOPE_API_KEY=your_key_here # Qwen (DashScope) - https://dashscope.console.aliyun.com/
DEEPSEEK_API_KEY=your_key_here # DeepSeek - https://platform.deepseek.com/
ANTHROPIC_API_KEY=your_key_here # Claude (Anthropic) - https://console.anthropic.com/
MOONSHOT_API_KEY=your_key_here # Kimi (Moonshot) - https://platform.moonshot.cn/- Install LLM Models:
Option A: All Models (Recommended)
# Master script for both backends
./scripts/download_all_models.shOption B: Ollama Only (GGUF) - Easiest Installation
# Ollama Models (GGUF Q4/Q8)
./scripts/download_ollama_models.sh
# Recommended core models:
# - qwen3:30b-instruct (18GB) - Main-LLM, 256K context
# - qwen3:8b (5.2GB) - Automatik, optional thinking
# - qwen2.5:3b (1.9GB) - Ultra-fast AutomatikOption C: vLLM Only (AWQ) - Best Performance
# Install vLLM (if not already done)
pip install vllm
# vLLM Models (AWQ Quantization)
./scripts/download_vllm_models.sh
# Recommended models:
# - Qwen3-8B-AWQ (~5GB, 40Kβ128K with YaRN)
# - Qwen3-14B-AWQ (~8GB, 32Kβ128K with YaRN)
# - Qwen2.5-14B-Instruct-AWQ (~8GB, 128K native)
# Start vLLM server with YaRN (64K context)
./venv/bin/vllm serve Qwen/Qwen3-14B-AWQ \
--quantization awq_marlin \
--port 8001 \
--rope-scaling '{"rope_type":"yarn","factor":2.0,"original_max_position_embeddings":32768}' \
--max-model-len 65536 \
--gpu-memory-utilization 0.85
# Systemd service setup: see docs/infrastructure/Option D: TabbyAPI (EXL2) - Experimental
# Not yet fully implemented
# See: https://github.com/theroyallab/tabbyAPI- Start ChromaDB Vector Cache (Docker):
Prerequisites: Docker Compose v2 recommended
# Install Docker Compose v2 (if not already installed)
sudo apt-get install docker-compose-plugin
docker compose version # should show v2.x.xStart ChromaDB:
cd docker
docker compose up -d chromadb
cd ..
# Verify it's healthy
docker ps | grep chroma
# Should show: (healthy) in status
# Test API v2
curl http://localhost:8000/api/v2/heartbeatOptional: Also start SearXNG (local search engine):
cd docker
docker compose --profile full up -d
cd ..Reset ChromaDB Cache (if needed):
Option 1: Complete restart (deletes all data)
cd docker
docker compose stop chromadb
cd ..
rm -rf docker/aifred_vector_cache/
cd docker
docker compose up -d chromadb
cd ..Option 2: Delete collection only (while container is running)
./venv/bin/python -c "
import chromadb
from chromadb.config import Settings
client = chromadb.HttpClient(
host='localhost',
port=8000,
settings=Settings(anonymized_telemetry=False)
)
try:
client.delete_collection('research_cache')
print('β
Collection deleted')
except Exception as e:
print(f'β οΈ Error: {e}')
"- Start XTTS Voice Cloning (Optional, Docker):
XTTS v2 provides high-quality voice cloning with multilingual support and smart GPU/CPU selection.
cd docker/xtts
docker compose up -dFirst start takes ~2-3 minutes (model download ~1.5GB). After that, XTTS is available as TTS engine in the UI settings.
Features:
- 58 built-in voices + custom voice cloning (6-10s reference audio)
- Automatic GPU/CPU selection based on available VRAM
- Manual CPU-Mode Toggle: Save GPU VRAM for larger LLM context window (slower TTS)
- Multilingual support (16 languages) with automatic code-switching (DE/EN mixed)
- Per-agent voices with individual pitch and speed settings
- Multi-Agent TTS Queue: Sequential playback of AIfred β Sokrates β Salomo responses
- Async TTS generation (doesn't block next LLM inference)
- VRAM Management: In GPU mode, ~2 GB VRAM is reserved and deducted from LLM context window
See docker/xtts/README.md for full documentation.
- Start MOSS-TTS Voice Cloning (Optional, Docker):
MOSS-TTS (MossTTSLocal 1.7B) provides state-of-the-art zero-shot voice cloning across 20 languages with excellent speech quality.
cd docker/moss-tts
docker compose up -dFirst start takes ~5-10 minutes (model download ~3-5 GB). After that, MOSS-TTS is available as TTS engine in the UI settings.
Features:
- Zero-shot voice cloning (reference audio, no transcription needed)
- 20 languages including German and English
- Excellent speech quality (EN SIM 73.42%, ZH SIM 78.82% - best open-source)
Limitations:
- High VRAM usage: ~11.5 GB in BF16 (vs. 2 GB for XTTS)
- Not suitable for streaming: ~18-22s per sentence (vs. ~1-2s for XTTS)
- VRAM Management: In GPU mode, ~11.5 GB VRAM is reserved and deducted from LLM context window
- Recommended for high-quality offline audio generation, not for real-time streaming
- Start application:
reflex runThe app will run at: http://localhost:3002
AIfred supports different LLM backends that can be switched dynamically in the UI:
- llama.cpp (via llama-swap): GGUF models, best raw performance (+43% generation, +30% prompt processing vs Ollama), full GPU control, multi-GPU support. Uses a 3-tier architecture: llama-swap (Go proxy, model management) β llama-server (inference) β llama.cpp (library). Automatic VRAM calibration via 3-phase Binary Search: GPU-only context sizing β Speed variant with optimized tensor-split for multi-GPU throughput β Hybrid NGL fallback for oversized models. See setup guide.
- Ollama: GGUF models (Q4/Q8), easiest installation, automatic model management, good performance after v2.32.0 optimizations
- vLLM: AWQ models (4-bit), best performance with AWQ Marlin kernel
- TabbyAPI: EXL2 models (ExLlamaV2/V3) - experimental, basic support only
AIfred automatically detects your GPU at startup and warns about incompatible backend configurations:
- Tesla P40 / GTX 10 Series (Pascal): Use llama.cpp or Ollama (GGUF) - vLLM/AWQ not supported
- RTX 20+ Series (Turing/Ampere/Ada): llama.cpp (GGUF) or vLLM (AWQ) recommended for best performance
Detailed information: GPU_COMPATIBILITY.md
Settings are saved in data/settings.json:
Per-Backend Model Storage:
- Each backend remembers its last used models
- When switching backends, the correct models are automatically restored
- On first start, defaults from
aifred/lib/config.pyare used
Sampling Parameter Persistence:
| Parameter | Persisted? | On Restart | On Model Change |
|---|---|---|---|
| Temperature | Yes (settings.json) | Kept | Reset to YAML |
| Top-K, Top-P, Min-P, Repeat-Penalty | No | Reset to YAML | Reset to YAML |
Source of truth for sampling defaults: --temp, --top-k, --top-p, --min-p, --repeat-penalty flags in the llama-swap YAML config (~/.config/llama-swap/config.yaml).
Example Settings Structure:
{
"backend_type": "vllm",
"enable_thinking": true,
"backend_models": {
"ollama": {
"selected_model": "qwen3:8b",
"automatik_model": "qwen2.5:3b"
},
"vllm": {
"selected_model": "Qwen/Qwen3-8B-AWQ",
"automatik_model": "Qwen/Qwen3-4B-AWQ"
}
}
}AIfred supports per-agent reasoning configuration for enhanced answer quality.
Per-Agent Reasoning Toggles (v2.23.0):
Each agent (AIfred, Sokrates, Salomo) has its own reasoning toggle in the LLM settings. These toggles control both mechanisms:
- Reasoning Prompt: Chain-of-Thought instructions in the system prompt (works for ALL models)
- enable_thinking Flag: Technical flag for thinking-capable models (Qwen3, QwQ, NemoTron)
| Toggle | Reasoning Prompt | enable_thinking | Effect |
|---|---|---|---|
| ON | β Injected | β True | Full CoT with <think> blocks (thinking models) |
| ON | β Injected | β True | CoT instructions followed (instruct models, no <think>) |
| OFF | β Not injected | β False | Direct answers, no reasoning |
Design Rationale:
- Instruct models (without native
<think>tags) benefit from CoT prompt instructions - Thinking models receive both: CoT prompt + technical flag for
<think>block generation - This unified approach provides consistent behavior regardless of model type
Additional Features:
- Formatting: Reasoning process displayed as collapsible accordion with model name and inference time
- Temperature: Independent from reasoning - uses Intent Detection (auto) or manual value in sampling table
- Automatik-LLM: Reasoning always DISABLED for Automatik decisions (8x faster)
AIfred-Intelligence/
βββ aifred/
β βββ backends/ # LLM Backend Adapters
β β βββ base.py # Abstract Base Class
β β βββ llamacpp.py # llama.cpp Backend (GGUF via llama-swap)
β β βββ ollama.py # Ollama Backend (GGUF)
β β βββ vllm.py # vLLM Backend (AWQ)
β β βββ tabbyapi.py # TabbyAPI Backend (EXL2)
β βββ lib/ # Core Libraries
β β βββ multi_agent.py # Multi-Agent System (AIfred, Sokrates, Salomo)
β β βββ context_manager.py # History compression
β β βββ conversation_handler.py # Automatik mode, RAG context
β β βββ config.py # Default settings
β β οΏ½οΏ½οΏ½ββ vector_cache.py # ChromaDB Vector Cache
β β βββ model_vram_cache.py # Unified VRAM cache (all backends)
β β βββ llamacpp_calibration.py # llama.cpp Binary Search calibration
β β βββ gguf_utils.py # GGUF metadata reader (native context, quant)
β β βββ research/ # Web research modules
β β β βββ orchestrator.py # Research orchestration
β β β βββ url_ranker.py # LLM-based URL ranking
β β β βββ query_processor.py # Query processing
β β βββ tools/ # Tool implementations
β β βββ search_tools.py # Parallel web search
β β βββ scraper_tool.py # Parallel web scraping
β βββ aifred.py # Main application / UI
β βββ state.py # Reflex State Management
βββ prompts/ # System Prompts (de/en)
βββ scripts/ # Utility Scripts
βββ docs/ # Documentation
β βββ infrastructure/ # Service setup guides
β βββ architecture/ # Architecture docs
β βββ GPU_COMPATIBILITY.md # GPU compatibility matrix
βββ data/ # Runtime data (settings, sessions, caches)
β βββ settings.json # User settings
β βββ model_vram_cache.json # VRAM calibration data (all backends)
β βββ sessions/ # Chat sessions
β βββ logs/ # Debug logs
βββ docker/ # Docker configurations
β βββ aifred_vector_cache/ # ChromaDB Docker setup
βββ CHANGELOG.md # Project Changelog
At 70% context utilization, older conversations are automatically compressed using PRE-MESSAGE checks (v2.12.0):
| Parameter | Value | Description |
|---|---|---|
HISTORY_COMPRESSION_TRIGGER |
0.7 (70%) | Compression triggers at this context utilization |
HISTORY_COMPRESSION_TARGET |
0.3 (30%) | Target after compression (room for ~2 roundtrips) |
HISTORY_SUMMARY_RATIO |
0.25 (4:1) | Summary = 25% of content being compressed |
HISTORY_SUMMARY_MIN_TOKENS |
500 | Minimum for meaningful summaries |
HISTORY_SUMMARY_TOLERANCE |
0.5 (50%) | Allowed overshoot, above this gets truncated |
HISTORY_SUMMARY_MAX_RATIO |
0.2 (20%) | Max context percentage for summaries (NEW) |
Algorithm (PRE-MESSAGE):
- PRE-CHECK before each LLM call (not after!)
- Trigger at 70% context utilization
- Dynamic max_summaries based on context size (20% budget / 500 tok)
- FIFO cleanup: If too many summaries, oldest is deleted first
- Collect oldest messages until remaining < 30%
- Compress collected messages to summary (4:1 ratio)
- New History = [Summaries] + [remaining messages]
Dynamic Summary Limits:
| Context | Max Summaries | Calculation |
|---|---|---|
| 4K | 1-2 | 4096 Γ 0.2 / 500 = 1.6 |
| 8K | 3 | 8192 Γ 0.2 / 500 = 3.3 |
| 32K | 10 | 32768 Γ 0.2 / 500 = 13 β capped at 10 |
Token Estimation: Ignores <details>, <span>, <think> tags (not sent to LLM)
Examples by Context Size:
| Context | Trigger | Target | Compressed | Summary |
|---|---|---|---|---|
| 7K | 4,900 tok | 2,100 tok | ~2,800 tok | ~700 tok |
| 40K | 28,000 tok | 12,000 tok | ~16,000 tok | ~4,000 tok |
| 200K | 140,000 tok | 60,000 tok | ~80,000 tok | ~20,000 tok |
Inline Summaries (UI, v2.14.2+):
- Summaries appear inline where compression occurred
- Each summary as collapsible with header (number, message count)
- FIFO applies only to
llm_history(LLM sees 1 summary) chat_historykeeps ALL summaries (user sees full history)
AIfred uses a multi-tier cache system based on semantic similarity (Cosine Distance) with pure semantic deduplication and intelligent cache usage for explicit research keywords.
Phase 0: Explicit Research Keywords
User Query: "research Python" / "google Python" / "search internet Python"
ββ Explicit keyword detected β Cache check FIRST
ββ Distance < 0.05 (practically identical)
β ββ β
Cache Hit (0.15s instead of 100s) - Shows age transparently
ββ Distance β₯ 0.05 (not identical)
ββ New web research (user wants fresh data)
Phase 1a: Direct Cache Hit Check
User Query β ChromaDB Similarity Search
ββ Distance < 0.5 (HIGH Confidence)
β ββ β
Use Cached Answer (instant, no more time checks!)
ββ Distance 0.5-1.2 (MEDIUM Confidence) β Continue to Phase 1b (RAG)
ββ Distance > 1.2 (LOW Confidence) β Continue to Phase 2 (Research Decision)
Phase 1b: RAG Context Check
Cache Miss (d β₯ 0.5) β Query for RAG Candidates (0.5 β€ d < 1.2)
ββ Found RAG Candidates?
β ββ YES β Automatik-LLM checks relevance for each candidate
β β ββ Relevant (semantic match) β Inject as System Message Context
β β β Example: "Python" β "FastAPI" β
(FastAPI is Python framework)
β β ββ Not Relevant β Skip
β β Example: "Python" β "Weather" β (no connection)
β ββ NO β Continue to Phase 2
ββ LLM Answer with RAG Context (Source: "Cache+LLM (RAG)")
Phase 2: Research Decision
No Direct Cache Hit & No RAG Context
ββ Automatik-LLM decides: Web Research needed?
ββ YES β Web Research + Cache Result
ββ NO β Pure LLM Answer (Source: "LLM-Training Data")
When Saving to Vector Cache:
New Research Result β Check for Semantic Duplicates
ββ Distance < 0.3 (semantically similar)
ββ β
ALWAYS Update
- Delete old entry
- Save new entry
- Guaranteed: Latest data is used
Purely semantic deduplication without time checks β Consistent behavior.
| Distance | Confidence | Behavior | Example |
|---|---|---|---|
0.0 - 0.05 |
EXACT | Explicit research uses cache | Identical query |
0.05 - 0.5 |
HIGH | Direct cache hit | "Python tutorial" vs "Python guide" |
0.5 - 1.2 |
MEDIUM | RAG candidate (relevance check via LLM) | "Python" vs "FastAPI" |
1.2+ |
LOW | Cache miss β Research decision | "Python" vs "Weather" |
Maintenance tool for Vector Cache:
# Show stats
python3 chroma_maintenance.py --stats
# Find duplicates
python3 chroma_maintenance.py --find-duplicates
# Remove duplicates (Dry-Run)
python3 chroma_maintenance.py --remove-duplicates
# Remove duplicates (Execute)
python3 chroma_maintenance.py --remove-duplicates --execute
# Delete old entries (> 30 days)
python3 chroma_maintenance.py --remove-old 30 --executeHow it works:
- Query finds related cache entries (distance 0.5-1.2)
- Automatik-LLM checks if cached content is relevant to current question
- Relevant entries are injected as system message: "Previous research shows..."
- Main LLM combines cached context + training knowledge for enhanced answer
Example Flow:
User: "What is Python?" β Web Research β Cache Entry 1 (d=0.0)
User: "What is FastAPI?" β RAG finds Entry 1 (d=0.7)
β LLM checks: "Python" relevant for "FastAPI"? YES (FastAPI uses Python)
β Inject Entry 1 as context β Enhanced LLM answer
β Source: "Cache+LLM (RAG)"
Benefits:
- Leverages related past research without exact cache hits
- Avoids false context (LLM filters irrelevant entries)
- Multi-level context awareness (cache + conversation history)
The Main LLM determines cache lifetime via <volatility> tag in response:
| Volatility | TTL | Use Case |
|---|---|---|
DAILY |
24h | News, current events, "latest developments" |
WEEKLY |
7 days | Political updates, semi-current topics |
MONTHLY |
30 days | Statistics, reports, less volatile data |
PERMANENT |
β | Timeless facts ("What is Python?") |
Automatic Cleanup: Background task runs every 12 hours, deletes expired entries.
Cache behavior in aifred/lib/config.py:
# Cache Distance Thresholds
CACHE_DISTANCE_HIGH = 0.5 # < 0.5 = HIGH confidence cache hit
CACHE_DISTANCE_DUPLICATE = 0.3 # < 0.3 = semantic duplicate (always merged)
CACHE_DISTANCE_RAG = 1.2 # < 1.2 = similar enough for RAG context
# TTL (Time-To-Live)
TTL_HOURS = {
'DAILY': 24,
'WEEKLY': 168,
'MONTHLY': 720,
'PERMANENT': None
}RAG Relevance Check: Uses Automatik-LLM with dedicated prompt (prompts/de/rag_relevance_check.txt)
All important parameters in aifred/lib/config.py:
# History Compression (dynamic, percentage-based)
HISTORY_COMPRESSION_TRIGGER = 0.7 # 70% - When to compress?
HISTORY_COMPRESSION_TARGET = 0.3 # 30% - Where to compress to?
HISTORY_SUMMARY_RATIO = 0.25 # 25% = 4:1 compression
HISTORY_SUMMARY_MIN_TOKENS = 500 # Minimum for summaries
HISTORY_SUMMARY_TOLERANCE = 0.5 # 50% overshoot allowed
# Intent-based Temperature
INTENT_TEMPERATURE_FAKTISCH = 0.2 # Factual queries
INTENT_TEMPERATURE_GEMISCHT = 0.5 # Mixed queries
INTENT_TEMPERATURE_KREATIV = 1.0 # Creative queries
# Backend-specific Default Models (in BACKEND_DEFAULT_MODELS)
# Ollama: qwen3:4b-instruct-2507-q4_K_M (Automatik), qwen3-vl:8b (Vision)
# vLLM: cpatonn/Qwen3-4B-Instruct-2507-AWQ-4bit, etc.In aifred/backends/ollama.py:
- HTTP Client Timeout: 300 seconds (5 minutes)
- Increased from 60s for large research requests with 30KB+ context
- Prevents timeout errors during first token generation
The AIfred restart button restarts the systemd service:
- Executes
systemctl restart aifred-intelligence - Browser reloads automatically after short delay
- Debug logs cleared, sessions preserved
For production operation as a service, pre-configured service files are available in the systemd/ directory.
AIFRED_ENV=prod MUST be set for AIfred to run on the MiniPC and not redirect to the development machine!
# 1. Copy service files
sudo cp systemd/aifred-chromadb.service /etc/systemd/system/
sudo cp systemd/aifred-intelligence.service /etc/systemd/system/
# 2. Enable and start services
sudo systemctl daemon-reload
sudo systemctl enable aifred-chromadb.service aifred-intelligence.service
sudo systemctl start aifred-chromadb.service aifred-intelligence.service
# 3. Check status
systemctl status aifred-chromadb.service
systemctl status aifred-intelligence.service
# 4. Create first user (required for login)
./aifred-admin add yourusername
# Then register in the web UI with username + passwordSee systemd/README.md for details, troubleshooting, and monitoring.
AIfred requires user authentication. Manage users via the admin CLI:
./aifred-admin users # List whitelist (who can register)
./aifred-admin add <username> # Add user to whitelist
./aifred-admin remove <username> # Remove from whitelist
./aifred-admin accounts # List registered accounts
./aifred-admin delete <username> # Delete account (with confirmation)
./aifred-admin delete <username> --sessions # Also delete user's sessionsWorkflow:
- Admin adds username to whitelist:
./aifred-admin add alice - User registers in web UI with username + password
- User can now login from any device with their credentials
1. ChromaDB Service (systemd/aifred-chromadb.service):
[Unit]
Description=AIfred ChromaDB Vector Cache (Docker)
After=docker.service
Requires=docker.service
[Service]
Type=oneshot
RemainAfterExit=yes
WorkingDirectory=/home/mp/Projekte/AIfred-Intelligence/docker
ExecStart=/usr/bin/docker compose up -d chromadb
ExecStop=/usr/bin/docker compose stop chromadb2. AIfred Intelligence Service (systemd/aifred-intelligence.service):
[Unit]
Description=AIfred Intelligence Voice Assistant (Reflex Version)
After=network.target ollama.service aifred-chromadb.service
Wants=ollama.service
Requires=aifred-chromadb.service
[Service]
Type=simple
User=__USER__
Group=__USER__
WorkingDirectory=__PROJECT_DIR__
Environment="PATH=__PROJECT_DIR__/venv/bin:/usr/local/bin:/usr/bin:/bin"
Environment="PYTHONUNBUFFERED=1"
ExecStart=__PROJECT_DIR__/venv/bin/python -m reflex run --env prod --frontend-port 3002 --backend-port 8002 --backend-host 0.0.0.0
Restart=always
KillMode=control-group
RestartSec=10
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target__USER__ and __PROJECT_DIR__ with your actual values!
For production/external access, create a .env file in the project root (this file is gitignored and NOT pushed to the repository):
# Environment Mode (required for production)
AIFRED_ENV=prod
# Backend API URL for external access via nginx reverse proxy
# Set this to your external domain/IP for HTTPS access
AIFRED_API_URL=https://your-domain.com:8443
# API Keys for web search (optional)
BRAVE_API_KEY=your_brave_api_key
TAVILY_API_KEY=your_tavily_api_key
# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
# IMPORTANT: Set OLLAMA_NUM_PARALLEL=1 in Ollama service config (see Performance section below)
# Backend URL for static files (HTML Preview, Images)
# With NGINX: Leave empty or omit - NGINX routes /_upload/ to backend
# Without NGINX (dev): Set to backend URL for direct access
# BACKEND_URL=http://localhost:8002
# Cloud LLM API Keys (optional - only needed if using cloud backends)
DASHSCOPE_API_KEY=your_key_here # Qwen (DashScope)
DEEPSEEK_API_KEY=your_key_here # DeepSeek
ANTHROPIC_API_KEY=your_key_here # Claude (Anthropic)
MOONSHOT_API_KEY=your_key_here # Kimi (Moonshot)Why is AIFRED_API_URL needed?
The Reflex frontend needs to know where the backend is located. Without this setting:
- The frontend auto-detects the local IP (e.g.,
http://192.168.0.252:8002) - This works for local network access but fails for external HTTPS access
- External users would see WebSocket connection errors to
localhost
With AIFRED_API_URL=https://your-domain.com:8443:
- All API/WebSocket connections go through your nginx reverse proxy
- HTTPS works correctly for external access
- Local HTTP access continues to work
Why --env prod?
The --env prod flag in ExecStart:
- Disables Vite Hot Module Replacement (HMR) WebSocket
- Prevents "failed to connect to websocket localhost:3002" errors
- Reduces resource usage (no dev server overhead)
- Still recompiles on restart when code changes
FOUC Issue in Production Mode
In production mode (--env prod), a FOUC (Flash of Unstyled Content) may occur - a brief flash of unstyled text/CSS class names during page reload.
Cause: React Router 7 with prerender: true loads CSS asynchronously (lazy loading). The generated HTML is visible immediately, but Emotion CSS-in-JS is loaded afterwards.
Solution: Use Dev Mode
If FOUC is bothersome, use dev mode instead:
# Set in .env:
AIFRED_ENV=dev
# Or remove --env prod from the systemd serviceDev Mode Characteristics:
- No FOUC (CSS loaded synchronously)
- Slightly higher RAM usage (hot reload server)
- More console warnings (React Strict Mode)
- Non-minified bundles (slightly larger)
For a local home network server, these drawbacks are negligible.
Additionally required for Dev Mode with external access:
β οΈ IMPORTANT: The.web/vite.config.jsfile gets overwritten on Reflex updates! Use the patch script after updates:./scripts/patch-vite-config.sh
In .web/vite.config.js, the following must be configured:
- allowedHosts - for external domain access:
server: {
allowedHosts: ["your-domain.com", "localhost", "127.0.0.1"],
}- proxy - for API and TTS SSE streaming (required when accessing via frontend port 3002):
server: {
proxy: {
'/_upload': { target: 'http://0.0.0.0:8002', changeOrigin: true },
'/api': { target: 'http://0.0.0.0:8002', changeOrigin: true },
},
}Without the /api proxy, TTS streaming will fail with "text/html instead of text/event-stream" errors.
Optional: Polkit Rule for Restart Without sudo
For the restart button in the web UI without password prompt:
/etc/polkit-1/rules.d/50-aifred-restart.rules:
polkit.addRule(function(action, subject) {
if ((action.id == "org.freedesktop.systemd1.manage-units") &&
(action.lookup("unit") == "aifred-intelligence.service" ||
action.lookup("unit") == "ollama.service") &&
(action.lookup("verb") == "restart") &&
(subject.user == "mp")) {
return polkit.Result.YES;
}
});AIfred is designed as a single-user system but supports 2-3 concurrent users with certain limitations.
Session Isolation (Reflex Framework):
- Each browser tab gets its own session with unique
client_token(UUID) - Chat history is isolated - users don't see each other's conversations
- Streaming responses work in parallel - each user gets their own real-time updates
- Request queuing - Ollama automatically queues concurrent requests internally
Per-User Isolated State:
- β
Chat history (
chat_history,llm_history) - β Current messages and streaming responses
- β Image uploads and crop state
- β Session ID and device ID (cookie-based)
- β Failed sources and debug messages
Backend Configuration (shared across all users):
β οΈ Selected backend (Ollama, vLLM, TabbyAPI, Cloud API)β οΈ Backend URLβ οΈ Selected models (AIfred-LLM, Automatik-LLM, Sokrates-LLM, Salomo-LLM, Vision-LLM)β οΈ Available models listβ οΈ GPU info and VRAM cacheβ οΈ vLLM process manager
Settings File (data/settings.json):
β οΈ All settings are global (temperature, Multi-Agent mode, RoPE factors, etc.)β οΈ If User A changes a setting β User B sees the change immediatelyβ οΈ No per-user settings profiles
β SAFE: Multiple users sending requests
Timeline (Ollama automatically queues requests):
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
User A: Sends question β Ollama processes β Response to User A
User B: β Sends question β Waits in queue β Ollama processes β Response to User B
User C: β Sends question β Waits in queue β Ollama processes β Response to User C
- Each user gets their own correct answer
- Ollama's internal queue handles concurrent requests sequentially
- No race conditions as long as nobody changes settings during requests
User A: Sends request with Qwen3:8b β Processing...
User B: Switches model to Llama3:70b β Global state changes!
User A: Request continues with Qwen3 parameters (OK - already passed)
User A: Next request would use Llama3 (unintended)
- Settings changes affect all users immediately
- Running requests are safe (parameters already passed to backend)
- New requests from User A would use User B's settings
Session Storage:
- Sessions stored in RAM (plain dict by default, no Redis)
- No automatic expiration - sessions stay in memory until server restart
- Empty sessions are small (~1-5 KB each)
- Not a problem: Even 100 empty sessions = ~500 KB RAM
Chat History:
- Users who regularly clear their chat history keep memory usage low
- Full conversations (50+ messages) use more RAM but are manageable
- History compression (70% trigger) keeps context manageable
Why is backend configuration global?
AIfred is designed for local hardware with limited resources:
- Single GPU: Can only run one model at a time efficiently
- VRAM constraints: Loading different models per user would exceed VRAM
- Hardware is single-user oriented: All users must share the configured backend/models
This is intentional - the system is optimized for:
- Primary use case: 1 user, occasionally 2-3 users
- Shared hardware: Everyone uses the same GPU/models
- Root control: Administrator (you) manages settings, others use the system as configured
-
Establish usage rules:
- Designate one admin (root user) who manages settings
- Other users should not change backend/model settings
- Communicate when changing critical settings
-
Safe concurrent usage:
- β Multiple users can send requests simultaneously
- β Each user gets their own response and chat history
β οΈ Avoid changing settings while others are actively using the system
-
Expected behavior:
- Users see the same available models (shared dropdown)
- Settings changes sync across browser tabs within 1-2 seconds (via
settings.jsonpolling) - UI Sync Delay: Model dropdown may not visually update until clicked/reopened (known Reflex limitation)
- Multi-Agent mode and other simple settings sync immediately and visibly
- This is by design for single-GPU hardware
- β Not a multi-tenant SaaS: No per-user accounts, quotas, or isolated resources
- β Not designed for >5 concurrent users: Request queue would become slow
- β Not for untrusted users: Any user can change global settings (no permissions/roles)
- β Personal AI assistant for home/office use
- β Family-friendly: 2-3 family members can use it simultaneously without issues
- β Developer-focused: Root user has full control, others use it as configured
- β Hardware-optimized: Makes best use of single GPU for all users
Summary: AIfred works well for small groups (2-3 users) who coordinate settings changes, but is not suitable for large-scale multi-user deployments or untrusted user access.
tail -f data/logs/aifred_debug.log# Syntax check
python3 -m py_compile aifred/FILE.py
# Linting with Ruff
source venv/bin/activate && ruff check aifred/
# Type checking with mypy
source venv/bin/activate && mypy aifred/ --ignore-missing-importsProblem: Ollama's default OLLAMA_NUM_PARALLEL=2 doubles the KV-cache allocation for an unused second parallel slot. This wastes ~50% of your GPU VRAM.
Impact:
- With PARALLEL=2: 30B model fits ~111K context (with CPU offload)
- With PARALLEL=1: 30B model fits ~222K context (pure GPU, no offload)
Solution: Set OLLAMA_NUM_PARALLEL=1 in Ollama's systemd configuration:
# Create override directory
sudo mkdir -p /etc/systemd/system/ollama.service.d/
# Create override file
sudo tee /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
[Service]
Environment="OLLAMA_NUM_PARALLEL=1"
EOF
# Apply changes
sudo systemctl daemon-reload
sudo systemctl restart ollamaWhen to use PARALLEL=1:
- Single-user setups (home server, personal workstation)
- Maximum context window needed for research/RAG tasks
When to keep PARALLEL=2+:
- Multi-user server with concurrent requests
- Load balancing scenarios
After changing this setting, recalibrate your models in the UI to take advantage of the freed VRAM.
Benchmarks with Qwen3-30B-A3B Q8_0 on 2Γ Tesla P40 (48 GB VRAM total):
| Metric | llama.cpp | Ollama | Advantage |
|---|---|---|---|
| TTFT (Time to First Token) | 1.1s | 1.5s | llama.cpp -27% |
| Generation Speed | 39.3 tok/s | 27.4 tok/s | llama.cpp +43% |
| Prompt Processing | 1,116 tok/s | 862 tok/s | llama.cpp +30% |
| Intent Detection | 0.8s | 0.7s | similar |
When to choose llama.cpp:
- Maximum generation speed and throughput
- Multi-GPU setups (full tensor split control)
- Large context windows (direct VRAM calibration)
- Production deployments where every tok/s counts
When to choose Ollama:
- Quick setup and experimentation
- Automatic model management (
ollama pull) - Simpler configuration for beginners
Problem: httpx.ReadTimeout after 60 seconds on large research requests
Solution: Timeout is already increased to 300s in aifred/backends/ollama.py
If problems persist: Restart Ollama service with systemctl restart ollama
Problem: AIfred service doesn't start or stops immediately Solution:
# Check logs
journalctl -u aifred-intelligence -n 50
# Check Ollama status
systemctl status ollamaProblem: Restart button in web UI has no effect
Solution: Check Polkit rule in /etc/polkit-1/rules.d/50-aifred-restart.rules
More documentation in the docs/ directory:
- Architecture Overview
- API Documentation
- Migration Guide
- llama.cpp + llama-swap Setup Guide
- Tensor Split Benchmark: Speed vs. Full Context
Pull requests are welcome! For major changes, please open an issue first.
MIT License - see LICENSE file