Clawz, Claws, HybridClaw: What a Lightweight Runtime Architecture for AI Agents Actually Looks Like

Design principles, performance, security, scaling and resilience — and why most agent platforms quietly fall apart on exactly those dimensions.

There’s a new family of AI agent runtimes that has emerged over the last few months, often grouped colloquially as “Claws” or sometimes “Clawz”: OpenClaw (formerly MoltBot, the personal experimentation framework), PaperClip (the agentic framework from Sebastian Küpers and the Serviceplan Group), Hermes-Agent, and our own HybridClaw. What connects them is an architectural break with the first wave of agent platforms: away from monolithic cluster setups, toward lean, locally executable runtimes that boot in seconds and can still be operated in production.

This article is about the architecture and design principles that make HybridClaw a lightweight runtime by deliberate choice — and why exactly that choice is what allows performance, security, scaling and resilience to be addressed cleanly in the first place.

Why “Lightweight” is the Right Model for Agents

The first generation of agent platforms borrowed its blueprint from classical microservice stacks: Kubernetes, service mesh, a dedicated cluster for the vector DB, another for the queue, a third for observability. That blueprint works well for hyperscalers and high-traffic consumer APIs. For agents, it’s the wrong model.

An agent is not a stateless request handler. An agent is closer to a long-running process that composes skills, invokes tools, queries models, drives browsers, and builds up a trajectory along the way — a trajectory that will later be needed for evals, replay and audit. If you press this kind of object into a classical microservice stack, you optimise the wrong axis: you get high availability for components that rarely fail, and at the same time no good answers to the questions that actually hurt in agent operations — “Which skill update broke performance last week?”, “Which tool call set the wrong VAT code?”, “Can we reproduce the run from Tuesday at 11:43?”

HybridClaw takes the opposite path. The runtime is a single Go-style binary that boots on a laptop in under a second. No cluster, no Kubernetes, no Helm chart as a prerequisite. Production deployments scale out to multiple worker processes, but that’s an option, not an entry fee. People who want to start with agents don’t need a DevOps team. People who want to run them in production have the same code path — just with more workers and a control plane on top.

This architectural decision isn’t an aesthetic. It’s the foundation for six concrete design principles that shape every other property of the runtime.

The Six Design Principles in Detail

1. Lightweight by Default

Single binary, no cluster, no mandatory state store. The runtime itself holds no shared mutable state — coordination happens through the HybridAI control plane (queue + leader election) when a multi-node setup is run. Until then, SQLite or the local filesystem is enough.

2. Local-First Execution

Agents run where the data lives. Skills, tool calls and browser automation execute in the user’s environment by default. No cleartext data leaves the perimeter without explicit routing through the control plane. This isn’t only a GDPR feature — it’s also a latency argument. Anyone trying to route browser automation from a corporate data centre, through public cloud, onto a local ERP login screen has already lost.

3. Deterministic Skills

Skills are versioned, content-addressed manifests. Same input + same skill version = same trajectory. This is the point where many agent frameworks quietly cheat: they stitch together the system prompt at runtime from three templates, pull values out of environment variables, and hope it stays reproducible. It doesn’t. If skills aren’t deterministically addressable, evals are not meaningful and rollbacks are not safe. HybridClaw treats skills like code artefacts: signed, versioned, hash-identifiable.

4. Sandboxed Tool Use

Every tool runs in an isolated execution context with explicit capability grants. Browser automation, file access, shell commands — all gated by signed manifests and runtime policy. Default-deny. A skill that needs to send email doesn’t get shell access. A skill that analyses logs doesn’t reach the mail API. This sounds obvious, but in practice it’s the point where most agent demos fall apart in a real compliance review.

5. Content-Addressed Artifacts

Trajectories, skill outputs and audit records are content-addressed. Every entry gets a hash that uniquely identifies its content. Entries are chained — tampering is detectable. This makes traces reproducible, replayable, and forensically defensible. It’s the prerequisite for being able to prove to an auditor seven years from now what an agent did and when. Without this property, audit logs are just pretty tables with no legal weight.

6. Observable Everything

Every span, every tool call, every model invocation emits structured telemetry by default. Not as a plugin, not as an optional extension, not as “we’ll add it later”. Operators get dashboards for latency, cost-per-task, eval scores and safety incidents out of the box. Running an agent system without this telemetry is running on blind trust — and blind trust is not a strategy in an enterprise context.

Performance & Scaling: From the Laptop to the Worker Fleet

Scalability in agent runtimes is rarely a problem of raw request rate. An agent fleet with fifty agents running in parallel is not what makes classical web backends sweat. What makes it hard is that agents run long, make an unpredictable number of tool calls, and stream model responses rather than waiting for them. HybridClaw addresses this on five levels.

Agent-level concurrency. Every agent has its own task queue. A long-running tool call — say, a browser session waiting 90 seconds for a confirmation page — doesn’t block sibling agents. This is not a trivial pattern; many early agent frameworks had a global queue here and broke on exactly that.

Batched LLM calls. When several agents query the same model at the same time, in-flight prompts are batched where the provider allows it. This noticeably reduces cost and latency for high-volume workloads. With 100+ requests per minute to the same model, it’s not “nice to have” — it determines the token bill at the end of the month.

Multi-layer cache. Skill outputs, retrieval results and tool responses are cached at three levels: in-memory for the active task, on-disk for the agent process, and cross-worker on explicit opt-in. The last one is deliberately not the default — cache coherence between workers is one of the most common sources of bugs in distributed agent systems.

Worker scaling. Horizontal scaling happens through additional worker processes. Coordination runs through the HybridAI control plane (queue + leader election), not inside the runtime itself. This is the architectural decision that keeps the single-binary promise alive: the runtime knows nothing about other workers. It just does its job. The cluster concern lives one layer up, on purpose.

Streaming everywhere. Model outputs, tool results and channel responses stream end-to-end. There’s no waiting for a complete model completion before the next step starts. That doesn’t only reduce perceived latency in chat interfaces — it also lets downstream tools start working while the main model is still thinking.

Security: Assume Hostile Inputs

Agent platforms multiply the blast radius of every security flaw. A classical chatbot that gets jailbroken produces, at worst, embarrassing text. An agent that gets jailbroken can send emails, pay invoices, file tickets, write to databases. Anyone running agents seriously has to assume hostile inputs — both from outside (prompt injection in an incoming email) and from inside (a model that hallucinates and believes itself to be authorised).

HybridClaw stacks six security layers, all of them default-on:

Secrets vault. Tools never see raw credentials. Passwords, API keys and OAuth tokens are referenced by ID and resolved at runtime through the control plane. Audit logs and trajectories only ever contain the IDs, never the cleartext. Grepping the logs won’t yield secrets.

RBAC & capability grants. Permissions are granted per agent, per skill, per tool. Default-deny: whatever isn’t explicitly allowed is forbidden. That includes seemingly innocuous things like network access.

Sandboxed execution. File access, shell commands and browser automation run in isolated contexts with no host network access by default. A skill that parses PDFs doesn’t reach the internet — unless that specific capability is explicitly granted.

Signed skill manifests. Skills are verified against signed manifests before they run. No unsigned code paths in production. This blocks supply-chain attacks via tampered skill updates — a risk that is systematically underestimated in this space.

Human-in-the-loop gates. High-impact actions — money transfers, bulk deletes, outbound mail to customers — can require human approval. Configurable per skill, audit-logged. If you want an agent to wire ten thousand euros, that should require a second click.

Tamper-evident audit log. Every action is content-addressed and chained. Operators can prove what an agent did, when, under whose authority, with which skill version, against which input, producing which output, at what latency, at what cost. In the managed cloud, the log is additionally mirrored to an external append-only store and retained for seven years in an audit-grade format.

Resilience: Failure as the Default Case

This is one of the most important differences between demo agents and production agents: production agents fail. Models time out, tools return unexpected errors, networks partition, a vendor changes its API without notice. Anyone building agents only for the happy path is building toys.

HybridClaw treats failure as the default case, not an edge case. Concretely:

Failure ModeRuntime Behaviour
Transient model errorExponential backoff; on repeated failures, fall back to a configured alternative model.
Tool exceptionError is captured in the trajectory; agent can retry, choose a different tool, or escalate.
Worker crashTask is requeued; idempotent skills resume from the last checkpoint. Non-idempotent skills surface a manual replay decision.
Bad skill versionEval gate blocks deploy if regression score crosses a threshold. If it slips through anyway: rollback is a single command and content-addressed.
Dead-letter queueTasks that exceed the retry budget land in a DLQ for human inspection — never silently dropped.
Cost runawayPer-agent budgets cap spend. Soft limits warn; hard limits stop new tasks until lifted manually.

The last entry is the one that tends to be missing from competing platforms — and the one that causes the most expensive damage. An agent stuck in a loop, racking up eight thousand euros in token costs over four hours, is not a theoretical scenario. We’ve seen it in real setups. Cost-runaway protection belongs on the same tier as security and audit, not in some “premium” plan.

Self-Hosted Runtime, Managed Control Plane — or Both

Architecturally, HybridClaw separates two layers cleanly: the runtime (open source, lightweight, self-hostable) and the control plane (managed, EU-hosted, with compliance features built in).

The runtime handles agent execution, skill manifests, tool sandboxing, browser automation, local trajectory capture, and the multi-channel adapters (Discord, Teams, WhatsApp, email, web, terminal). It’s openly available on GitHub, runs locally, and can be operated on your own infrastructure.

The control plane handles RAG and memory, the company brain, RBAC, the secrets vault, the audit log, observability dashboards, skill evals and deploy gates, budget controls, EU hosting, and GDPR/AI Act compliance. It lives on hybridai.one and is operated by us.

Both layers are independently deployable. The most common configurations we see with customers:

  • Fully managed cloud. Fastest path to production, operations handled by us, EU hosting on Hetzner.
  • Self-hosted runtime + managed control plane. Data control at the customer perimeter, but no need to build your own audit storage or eval setup.
  • Fully self-hosted. Maximum control, everything on your own infrastructure, local models via Ollama or vLLM, up to 80% token savings compared to cloud LLMs.
  • Hybrid deployment. Managed-cloud agents delegate sensitive tasks to self-hosted agents. Works over agent-to-agent messaging.

Important: skills and memory are portable between deployments. Starting in the managed cloud and migrating later doesn’t mean rebuilding from scratch.

Conclusion: Lightweight Is a Choice, Not a Shortcoming

When people hear “lightweight runtime”, they sometimes hear “toy” or “not enterprise-grade”. That would be a misreading. Lightweight here means: no complexity that isn’t justified by the use case. An agent doesn’t need a Kubernetes cluster to classify invoices. An agent doesn’t need twelve microservices to prepare an SAP posting. What it needs is a clean skill model, signed manifests, isolated tools, a tamper-evident audit log, and telemetry that shows whether it’s doing its job.

That is exactly what HybridClaw is. An open-source runtime that boots in under a second, ships as a single binary, and still brings every property you need to run in regulated, production-grade environments. The control plane on top is optional — but for anyone who doesn’t want to build GDPR compliance, seven-year audit retention and EU hosting themselves, it’s the substantially calmer route.

For the architecture in detail: hybridclaw.io/architecture. To get hands-on: GitHub for self-hosted, hybridai.one for the managed cloud. There’s a live demo at re:publica 26 in Berlin, where we’ll have a HybridClaw agent write a book on stage. Let’s see how far it gets.


Related reading:

OpenClaw Alternatives for Enterprises (2026)

Let’s be honest: OpenClaw changed everything. When it first shipped, the idea of a single AI assistant that lives across all your messaging channels felt like science fiction. Today it’s table stakes. But as more teams move from “let’s try this” to “let’s deploy this for real,” the cracks in the original are showing — and a new generation of claw-like agents is stepping in to fill the gaps.

We’ve been building in this space for a while now, and we talk to enterprise teams every week who are evaluating their options. Here’s what the landscape actually looks like in 2026, and why we think it matters.

The Original: OpenClaw

You can’t have this conversation without starting here. OpenClaw is the most mature project in the space — 23+ channel adapters, a skills marketplace (ClawHub), voice wake mode, browser control, cron scheduling, Canvas with A2UI. It’s the kitchen sink, and for many teams, that’s exactly what they want.

But enterprise teams keep running into the same friction points. The ~500 MB memory footprint and 6-second startup feel heavy when you’re deploying hundreds of instances. The security model is application-level — permission checks in code, not actual OS isolation. And at ~400 source files with 53 configuration surfaces, onboarding new engineers takes longer than anyone admits.

If your team has the ops muscle and wants maximum channel coverage out of the box, OpenClaw is still a defensible choice. But if you’re evaluating with fresh eyes, keep reading.

NanoClaw: The Minimalist Thesis

NanoClaw took the opposite bet: what if the entire codebase was small enough that one engineer could understand it in an afternoon? It’s a single Node.js process, a handful of files, and true container isolation — not permission checks, but actual Docker or Apple Container boundaries per group.

The Agent Swarms feature is genuinely novel. NanoClaw was the first personal assistant framework to ship it, and it unlocks parallel task execution patterns that larger tools are still catching up on. The trade-off is ecosystem breadth — fewer channels, fewer integrations, and customization means changing code, not toggling config flags. For teams that want deep control and can live with five solid channels instead of twenty-three, NanoClaw punches well above its weight.

NullClaw: The Edge Play

This one is wild. NullClaw compiles to a 678 KB static Zig binary, runs in under 1 MB of RAM, and cold-starts in less than 2 milliseconds. On paper, those numbers shouldn’t be possible for something that supports 50+ LLM providers and 19 channels.

The architecture is vtable-driven — every component (providers, channels, tools, memory) is a swappable interface. You can compile exactly the feature set you need. For edge deployments, embedded devices, or that $5 ARM board sitting in a closet, nothing else comes close. The catch is Zig itself: smaller talent pool, steeper onboarding, and the project is still pre-1.0. Enterprise teams running fleets of lightweight agents on constrained hardware should absolutely evaluate this. Everyone else can admire it from a distance.

OpenFang: The Security-First Heavyweight

If NullClaw is the lightweight champion, OpenFang is the enterprise heavyweight. Written in Rust across 14 crates and 137K lines of code, it positions itself as an “Agent Operating System” — and the framing is earned. Sixteen distinct security layers including Merkle audit trails, taint tracking, Ed25519 signing, and SSRF protection. WASM-metered sandboxing for tool execution. Forty channel adapters.

The killer feature is Hands — pre-built autonomous capability packages for lead generation, OSINT collection, video processing, research, and social media management. These run 24/7 without user prompting, which is exactly what enterprise operations teams want. Cold start under 200 ms, 40 MB memory, 32 MB binary. The downside: it’s Rust, it’s complex, and it’s v0.3.30. But for organizations where audit trail and security posture are non-negotiable, OpenFang is the answer today.

CoPaw: The APAC Bridge

CoPaw comes from AgentScope and targets a gap that Western-built tools consistently miss: first-class support for DingTalk, Feishu, and QQ alongside the usual Discord/Slack/Telegram stack. Desktop installers for Windows and macOS, a web console for configuration, and Python-based skill authoring make it the most accessible option for non-developer users.

For enterprises with significant APAC operations — especially teams in China — CoPaw solves a localization problem that no other tool in this list even attempts seriously. The trade-off is the usual Python story: heavier runtime, less security hardening, more cloud dependency.

Hermes Agent: The Learning Loop

Nous Research’s Hermes Agent is the only project here that genuinely improves itself during use. It creates skills from experience, refines them over time, searches past conversations, and builds persistent user profiles. The closed learning loop — where the agent curates its own memory with periodic nudges — is architecturally distinct from everything else in this space.

For research teams and organizations betting on long-horizon agent deployments where accumulated knowledge is the moat, Hermes is uniquely compelling. It’s also the most research-oriented tool here, with built-in support for trajectory generation and RL environments. Less polished for day-one enterprise deployment, but the trajectory (pun intended) is clear.

HybridClaw: Where We Landed

We built HybridClaw because we kept seeing the same gap. Enterprise teams wanted OpenClaw’s feature depth but with actual security isolation, EU-stack compatibility, and GDPR-aligned data handling — without the ops burden of running Rust or Zig in production.

HybridClaw runs as a Node.js gateway with Docker-sandboxed tool execution. RAG-powered retrieval with document-grounded responses. Structured audit trails with hash-chain verification. Bundled office skills — PDF, XLSX, DOCX, PPTX — that handle the kind of document workflows enterprises actually need, not just chat. MCP integration for extensibility. Local model support via LM Studio, Ollama, or vLLM for air-gapped deployments. A built-in admin console with dashboard, session management, model configuration, and audit views.

What we think makes the difference: HybridClaw treats security and compliance as first-class architectural decisions, not afterthoughts bolted on top. Container isolation by default. Credentials separated from config. An onboarding flow that requires explicit trust model acceptance before anything runs. And all of it in TypeScript — which means your team can actually audit, extend, and maintain it without hiring Zig or Rust specialists.

Is it the smallest? No, that’s NullClaw. The most channels? No, OpenFang. The most autonomous? Hermes has that covered. But for European enterprises that need a production-ready agent with real security, document workflows, and a codebase their existing team can own — that’s the gap we built for.

The Takeaway

The “claw-like agent” space in 2026 is no longer a one-horse race. OpenClaw set the template, but the next generation is fragmenting along clear lines: minimalism (NanoClaw, NullClaw), security-first (OpenFang, HybridClaw), regional fit (CoPaw), and self-improvement (Hermes). The right choice depends on your constraints — not on who shipped first.

Pick the tool that matches your actual deployment reality, not the one with the longest feature list.

Beyond ChatGPT: Why HybridClaw Redefines the Rules for Enterprises

1. Introduction: The AI Productivity Paradox

In today’s business landscape, we observe a critical paradox: while individual employees achieve impressive efficiency gains using isolated tools like ChatGPT, the overall systemic performance of organizations remains stagnant. The result is an uncontrolled shadow IT environment that not only poses a significant risk to data sovereignty but also isolates valuable knowledge within private chat windows.

We are leaving the era of mere “AI experimentation” behind and entering the phase of industrial-grade AI. The decisive step moves away from pure personal productivity—as offered by OpenClaw (formerly MoltBot) in private use—toward a resilient enterprise infrastructure powered by HybridAI. Companies that still rely on simple chatbots are missing out on the potential of scalable operational excellence.

So why is a standard interface no longer sufficient for businesses? The answer lies in the transformation from isolated text generation to orchestrated, secure, and collective intelligence.


2. Takeaway 1: The Power of Orchestration – One Brain, Many Experts

A major obstacle to strategic scaling is dependence on a single provider. HybridAI solves this through multi-model orchestration. Instead of relying on a single model, the system leverages a portfolio of more than 10 leading LLMs—including GPT-5, Claude, Gemini 3, Mistral, DeepSeek, and specialized models like Nano-Banana for image generation.

At its core is Intelligent Task Routing: the system autonomously recognizes the nature of a task and delegates it to the most capable expert. While Claude designs complex coding structures, GPT-5 handles deep research, Gemini 3 excels at high-speed queries, and Nano-Banana visualizes concepts.

Strategically, this ensures full vendor independence: companies are no longer tied to the fate of a single provider but remain flexible and future-proof.

“Multi-model orchestration, shared RAG for departments, specialized tools, and maximum data security. Give your teams the most powerful AI assistance with full control.”


3. Takeaway 2: From Individuals to Collective Intelligence – Shared RAG and Team Memory

In traditional setups, AI acts as a “lone operator,” starting from zero with every interaction. HybridAI transforms this into a long-term digital memory for the entire organization. Knowledge transfer is revolutionized through two mechanisms:

  • Shared RAG (Retrieval-Augmented Generation): Departments or project teams work with shared knowledge bases. SOPs, specific guidelines, and technical documentation are centrally provided, allowing AI to respond based on verified company data.
  • Team Memory: The AI continuously learns from interactions across the entire collective. This shared intelligence ensures that valuable insights are not lost when a browser tab is closed but persist as a strategic asset for the entire department.

4. Takeaway 3: Sovereignty by Default – The EU Stack Guarantee

For compliance leaders, data security is the foundation of any AI strategy. HybridAI offers an architecture where sovereignty is not an option—it is the default.

The system is fully GDPR-compliant, entirely hosted in the EU, and already aligned with the upcoming AI Act.

For organizations with the highest security requirements, self-hosting via vLLM on their own infrastructure is also available. A key technical enabler is the multi-layer security concept: an integrated filter detects sensitive data (PII – Personally Identifiable Information) and automatically masks it before processing.

Combined with a complete audit trail that documents every action, the system meets even the most stringent regulatory requirements.

“Maximum AI power with maximum security”


5. Takeaway 4: Agents Instead of Chatbots – Autonomous Tool Usage

The paradigm shift of HybridAI lies in its ability to act. We are no longer talking about simple text generators, but about AI agents that can autonomously access browsers, APIs, databases, and ERP systems. These agents don’t just perform tasks—they complete them.

Specialized tools unlock their full potential, particularly in professional departments:

  • BI Service: Enables business intelligence via text-to-SQL. Employees can perform complex data queries in natural language without needing SQL knowledge.
  • Tax Classification: Automates highly specific processes such as assigning VAT codes, Incoterms 2020, and HS codes for global trade.
  • Coding Agent: Supports software teams with automated code reviews and testing directly within workflows—optionally secured with locally hosted models for maximum IP protection.

6. Takeaway 5: The Visual Identity of Hybrid AI

The HybridClaw logo is a visual metaphor for this new form of collaboration. The organic lines of the jellyfish represent fluid human intelligence and adaptability. These merge with mechanical claws and circuitry, symbolizing machine precision and connectivity.

A central element is the data cubes, manipulated by the mechanical claws. They represent the raw informational building blocks of an enterprise. The symbolism is clear: AI serves as a precise tool to structure unorganized data and transform it into valuable knowledge through the organization’s organic intelligence.

It is the perfect fusion of biological adaptability and technological power.


7. Conclusion: The Future of the Augmented Workforce

OpenClaw and the HybridAI infrastructure mark the end of isolated AI experimentation. This is not just another tool in the software stack—it is the operating system for the collective intelligence of modern enterprises.

In a world where information is the most valuable resource, the quality of orchestration determines market success.

Is your company still stuck in “chat mode,” or are you already leveraging the full potential of an orchestrated AI workforce for your strategic transformation?

Try it now: https://hybridclaw.io

How Can I Enhance My BI Data with AI?

Over the past few months, we’ve witnessed a real boom in AI applications – from ChatGPT to Copilot and specialized enterprise solutions. A question that keeps coming up: How can I intelligently connect my existing Business Intelligence data with AI?

The answer is simpler said than done. While AI excels at handling unstructured text, structured databases present a unique challenge.

The Problem: Structured Data Meets AI

Why RAG Isn’t the Solution

The classic RAG approach (Retrieval Augmented Generation) works brilliantly for documents, PDFs, or knowledge bases. Text is converted into vectors and searched semantically.

But: BI data in SQL databases is structured. It thrives on:

  • Precise aggregations (SUM, AVG, COUNT)
  • Complex JOINs across multiple tables
  • WHERE conditions with exact values
  • GROUP BY for groupings

A vector search over your revenue table will never match the precision of a SQL query. RAG is simply the wrong tool here.

Why Simple SQL Tool Calls Are Too Limited

The next thought: “Let’s just give the LLM an SQL tool!”

The problem with that:

  • Lack of context continuity: For every question, the model must re-understand the entire schema context
  • No iteration: Complex analyses require multiple cascading queries
  • Reasoning overhead: The conversational model must simultaneously write SQL AND provide clever answers
  • Prompt collision: SQL syntax and natural conversation compete for context space

A model that’s supposed to do both – SQL and conversation – won’t do either particularly well.

The Solution: A Two-Layer Approach

The elegant solution lies in specialization through layering. Instead of having one model do everything, we split the work between two specialized LLMs:

Layer 1: Text2SQL – The Data Translator

The first LLM has a single task: convert textual queries into precise SQL.

Benefits of this specialization:

  • Focuses solely on schema understanding and SQL syntax
  • Can be fed with extensive database context
  • No “distraction” from conversational requirements
  • Smaller, faster model possible (e.g., GPT-3.5, Claude Haiku)

Workflow:

  1. User asks: “How much revenue did we make in Q4?”
  2. Text2SQL LLM translates to: SELECT SUM(revenue) FROM sales WHERE quarter = 4
  3. Query is executed, data comes back

Layer 2: Reasoning LLM – The Conversation Partner

The second, higher-level LLM is your intelligent analyst. It:

  • Conducts the conversation with the user
  • Decides what data is needed
  • Calls the Text2SQL LLM as a tool
  • Interprets the data and draws conclusions
  • Asks follow-up questions and conducts multi-turn analyses

Example Dialog:

User: “How is our revenue developing?”

Reasoning LLM thinks:
→ Need revenue data for recent quarters
→ Calls Text2SQL: “Revenue per quarter last 12 months”
→ Receives data: Q1: 1.2M, Q2: 1.5M, Q3: 1.8M, Q4: 2.1M

Reasoning LLM responds:
“Your revenue shows a clear upward trend with +75% growth over the last 4 quarters. Q4 was particularly strong with +16% vs Q3.”

User: “What’s driving that?”

Reasoning LLM thinks:
→ Need breakdown by product category Q4
→ Calls Text2SQL
→ Analyzes and responds: “The main drivers were…”

Why This Approach Is Superior

1. Separation of Concerns

Each LLM does what it does best:

  • Text2SQL: Precise SQL generation
  • Reasoning: Intelligent analysis and conversation

2. Better Performance

  • Smaller, faster models possible for Text2SQL
  • Less context-switching
  • Parallel optimization of both layers

3. Higher Quality

  • Text2SQL can be trained with detailed schema knowledge
  • Reasoning LLM focuses on insights, not syntax
  • Less “prompt pollution”

4. Easier Maintenance

  • Schema changes? Only adjust Text2SQL
  • Improve conversational style? Only adjust reasoning prompts
  • Clear responsibilities

5. Better Error Handling

  • SQL errors can be caught by the Text2SQL layer
  • Reasoning LLM can ask alternative questions
  • Graceful degradation possible

Implementation in Practice

At HybridAI, we implement exactly this approach for our clients:

  1. Text2SQL Layer: A specialized model familiarized with your database schema
  2. Reasoning Layer: Claude or GPT-4 for natural conversations about your data
  3. Security: Row-level security and access control at DB level
  4. Caching: Frequent queries are cached for faster responses

The result: Your employees can speak with your BI data in natural language – precisely, quickly, and intelligently.

Conclusion

Connecting AI with structured BI data is not a trivial task. Neither RAG nor simple SQL tools are sufficient.

The solution lies in intelligent division of labor: A specialized Text2SQL LLM translates queries into precise SQL, while a higher-level Reasoning LLM conducts the conversation and generates insights.

This two-layer approach combines the best of both worlds: The precision of structured queries with the flexibility of natural conversation.


Want to enhance your BI data with AI? At HybridAI, we support you in implementing intelligent data analysis solutions. Contact us for a non-binding conversation.

AI in Accounting: What Actually Works in 2026

Everyone’s talking about AI in accounting now. The consultants have their slides ready. The vendors are rebranding their invoice OCR tools. LinkedIn is full of posts about “How I automated my finance tasks – comment FINANCE to get my n8n workflow.” Wrappers everywhere.

Here’s my honest take after working on the technical AI side of this stuff for a while: AI will absolutely change accounting. But not in the way most people think, and probably not as fast as the headlines suggest.

The terminology mess

Let’s start with the basics, because the language around this is a disaster. When people say “AI in accounting,” they could mean any of the following:

Rule-based automation: If invoice amount > 10,000, route to CFO. If “Intra-community supply under § 4 1b”, then “VAT_00”. Not AI. Just code with a marketing budget.

Machine learning classifiers: Trained models that categorize transactions based on patterns. These are actually useful and have been around for years. But they often don’t generalize well and are hard to keep up to date, because often they’re a black box.

OCR and document extraction: Reading invoices and pulling out vendor names, amounts, dates. This is common practice now. And in some instances it actually works.

Large Language Models: Our favorite brand new toy. GPT, Claude, Gemini. Can understand context, interpret messy inputs, handle edge cases. But also: can hallucinate numbers with absolute confidence.

Most “AI accounting” products today are really just ML classifiers with an LLM-powered chatbot stapled on top. Which is fine, but let’s be honest about what we’re dealing with.

Where AI already works today

Behind all the marketing talk, there are real wins happening right now:

Document capture and extraction. Modern systems can read invoices in any format, any language, any level of scan quality. The combination of vision models and LLMs has basically solved this problem. You still need human review for edge cases, but 80-90% straight-through processing is achievable. And if you get an invoice in a new format, it still works. Because AI.

Transaction categorization. For standard cases, ML models are excellent at learning your chart of accounts and applying it consistently. They don’t get tired on Friday afternoons. They don’t have “creative” interpretations of cost centers. Also: don’t forget that AI doesn’t necessarily means LLMs. We also have an awesome family of sub 1b parameter encoder only models that are wonderful at classification.

Anomaly detection. Spotting duplicate invoices, unusual amounts, vendors that suddenly changed bank details. Pattern recognition at scale is exactly what ML does well. This is genuinely useful for fraud prevention and audit prep.

Natural language queries. “Show me all marketing expenses over 50k last quarter” without writing SQL. This works now. It’s not magic, but looks like magic and really saves time. Why not chat with your business data 😉

The common thread? These are all tasks where being approximately right most of the time is valuable, and where humans can easily verify the output.

Where things get interesting (and dangerous)

Now for the hard part.

The moment you need AI to make a decision that has legal or tax implications, everything changes. Consider VAT determination on an incoming invoice. Sounds simple: it’s 19%, right?

Except when it’s not. Is the supplier in another EU country? Is this a service or a good? Does reverse charge apply? Is it construction-related (§13b in Germany)? Is the supplier even VAT-registered? Is it a triangular trade? Is there a pandemic with special vat rates?

I’ve written about this specific problem with tax codes in SAP before. The short version: there are dozens of edge cases, and getting it wrong means audit findings, back taxes, and possibly fraud allegations.

Here’s the uncomfortable truth: LLMs are very good at explaining what reverse charge is. In academic voice or as a sonnet. But they’re dangerously unreliable at determining whether a specific invoice should use it. The difference matters.

The hallucination problem is real. An LLM will confidently tell you that this invoice clearly qualifies for intra-community supply treatment. It might even cite the relevant EU directive. It might also be completely wrong, because it didn’t notice the supplier has a German VAT ID, or because the goods never actually left the country. I ran a couple of examples through different LLMs – and they were very opinionated about certain things. But not necessarily right. So right now, we’re creating a VATBench to get a better view of this.

When Claude or GPT makes a mistake in a creative writing task, you get a weird sentence. When it makes a mistake in tax determination, you get a six-figure assessment in your next audit.

The hybrid AI architecture that actually works

So where does this leave us? Not with “AI bad, humans good.” The answer is architectural. A pattern that really works magic combines three things:

LLMs for interpretation. Let the language model read the invoice, extract the relevant facts, classify the transaction type, identify the supplier’s jurisdiction. This is what they’re good at – information extraction!

Structured rules for decisions. Tax law is not creative. It’s a decision tree with many branches but clear logic. Once you have the facts, applying the rules should be deterministic. No creativity needed. No hallucination possible.

Transparent audit trails. Every decision needs to document why it was made. Which invoice fields were extracted. How the supplier was classified. Which rule determined the tax code. When the auditor asks, you need answers.

The key insight: don’t ask the LLM what the tax code should be. Ask it to extract the facts, then apply your rules. It’s not half as sexy as “our AI automatically handles everything.” But it works.

What this means for CFO offices and finance teams

A few practical conclusions:

You’re not getting replaced. The “AI will automate away accounting” takes are mostly written by people who’ve never closed a month-end.

Your job is changing. Less data entry, more oversight. Less manual matching, more exception handling. Less typing, more thinking. If you’re spending 60% of your time on tasks that could be automated, you should definitely talk AI.

You need to understand the tools. Not how to build an LLM from scratch (even this is super fun to do). But how they work, where they fail, what they can and can’t do. The finance leaders who thrive will be the ones who can evaluate AI vendors with real technical understanding.

Start with contained problems. Don’t try to “AI-enable the entire finance function.” Pick one painful process with clear success criteria. Invoice capture. Expense categorization. Intercompany matching. Get that working, learn from it, then expand.

The bottom line on AI in accounting

AI in accounting is real, useful, and overhyped all at the same time. The technology works for information extraction, pattern matching, and natural language interfaces. It doesn’t work—not safely—for unsupervised decision-making on anything with legal consequences.

The winning approach combines the interpretive power of LLMs with the precision of rule-based systems and the oversight of human experts. It’s less exciting than “fully autonomous AI accounting” but it’s what actually ships, actually works, and actually survives audits.

Evaluation-Set for every Customer

Today we launched a new feature in the Prompt-Tuning-Clinic – the “Evaluation Criteria” Section.

It’s one of the most annoying things in AI to hunt for the question whether a custom configured AI (ChatBot, Agent, Automation) is doing well or not. In most cases both suppliers and customers are treating it like this:

“Yesterday i did run this prompt against it, and it looked really well, good progress!” – or – “My boss asked it to do x and it gave a total wrong answer, we have to redo the whole thing!”

Its an inherent problem of AI to some extent, for one because of the universal capability of these systems and the fact that you can ask practically everything and will always get an answer. And – due to the non-deterministic architecture and functioning of these systems it is very hard to define what it is doing and what not.

We were a bit tired of this, and so we thought – why are we reading LLMarena (btw – we launched german LLM-Arena recently, try it here) and other rankings of new AI models every second day and dont apply similar mechanisms to our customer installations?

This is exactly what this new feature brings:

  • define a couple of test-prompts (you can upload some treatment material like your API-Documentation or an md-file of the Website and let the AI make proposals for test-prompts)
  • run these prompts against the current configuration of the bot
  • Evaluate them (can also be done with an LLM automatically)
  • Define correct answers for edge cases
  • Save those prompts that are important permanently
  • Give them thumbs up/down to create cases for Fine-Tuning and DSPy
  • Run them all to get a quality ranking

Once this is set up the game is changing drastically, because now we (both supplier and customer) do have a well defined test-set of intended behavior that can be run automatically.

This is not only good for initial setup of a system, but also for Improvements, Model-Updates, new Settings etc.

And: as we are also offering fine-tuning for our models and have integrated DSPy as automated Prompt-Tuning tool you can create training-data for these while creating your Evaluation-Set as well – just thumbs up/down on the answer creates an entry in the test-database for later.

Sign up for a free Account and try it out!

Business Intelligence in the AI Era in 2026: Opportunities, Risks, and the Architecture Behind It

Let’s be honest: Does your company have all business-relevant information available at the push of a button? Or is it also stuck in various data silos, largely unconnected – the ERP here, the CRM there, plus Excel spreadsheets on personal drives and strategy documents somewhere in the cloud?

If you’re nodding right now, you’re in good company. I regularly speak with CEOs and finance leaders, and the picture is almost always the same: The data would be there. But bringing it together to answer a specific question takes days – if anyone can do it at all.

Why This Is Becoming a Problem Right Now

The days when companies could rely on stable markets and predictable developments are over. Inflation, geopolitical tensions, disrupted supply chains, a labor market in flux – all of this demands a new discipline: Decisions must not only be good, they must be good fast.

Traditional business intelligence has a proven answer to this: dashboards, KPIs, monthly reports. But let’s be honest – these tools hit their limits as soon as questions get more complex. What happens to our margin if we switch suppliers? How does a price increase affect different customer segments? What scenarios emerge if the euro keeps falling?

Questions like these need more than static charts. They need a real conversation with your own data.

The Temptation: An AI Sparring Partner for Your Decisions

This is exactly where generative AI gets really exciting. The idea is compelling: An intelligent assistant that knows your company’s numbers, understands connections, and lets you explore strategic options – anytime, without scheduling, without someone having to build an analysis first.

“How did our top 10 customers develop last quarter?” “What if we reduced the product portfolio by 20%?” “Compare our cost structure with last year and show me the biggest outliers.”

A dialogue like this would democratize business intelligence. Not just the controller with their Excel expertise would have access to insights – every decision-maker could query the data themselves. I still find this idea fascinating.

The Problem: When AI Hallucinates, It Gets Really Expensive

But – and this is a big but – here’s the crux. Large Language Models are impressive at generating plausible-sounding answers. They’re considerably less reliable at delivering factually correct answers. Especially when it comes to concrete numbers.

An AI that misremembers a date in a creative text? Annoying, but manageable. An AI that invents a revenue figure or miscalculates a margin during a business decision? That can really hurt. The danger multiplies because the answers are so damn convincing. We humans tend to trust a confidently delivered statement – even when it comes from a statistical language model.

I say this from experience: A naive integration of ChatGPT with company data is a risk, not progress. Anyone who sees it differently has either been lucky or hasn’t noticed yet.

The Technical Challenge: Connecting Three Worlds

The solution lies in a well-thought-out architecture that intelligently brings together three different data sources:

Structured data via SQL: The hard facts – revenues, costs, quantities, customer histories – typically reside in relational databases. Here, the AI must not guess but query precisely. The system must generate SQL queries, execute them, and correctly interpret the results. No room for creativity.

Unstructured data via RAG: Beyond the numbers, there’s context – strategy papers, market analyses, internal guidelines, meeting notes. These documents can be accessed through Retrieval Augmented Generation: The system searches for relevant text passages and provides them to the language model as context.

The model’s world knowledge: Finally, the LLM brings its own knowledge – about industries, economic relationships, best practices. This knowledge is valuable for interpretation, but dangerous when mixed with concrete company figures.

The art lies in cleanly separating these three sources and making transparent where each piece of information comes from.

The Solution: Everything into the Context Window

Modern LLMs offer context windows of 100,000 tokens and more. This opens up an elegant architectural approach: Instead of letting the model guess which data might be relevant, we proactively load all needed information into the context.

A well-designed system works in several steps: It analyzes the user’s question and identifies relevant data sources. Then it executes the necessary SQL queries. In parallel, it searches the document base via RAG. And finally, the LLM receives all this information served up together – with clear labeling of sources.

The language model thus becomes an interpreter and communicator, not a fact generator. It can explain numbers, reveal connections, ask follow-up questions, discuss options for action – but it doesn’t invent data, because the real data is already in the context.

Transparency as a Design Principle

Such a system must build transparency into its DNA. Every statement about concrete numbers should cite its source. The user must be able to trace: Does this come from the database? Was it quoted from a document? Or is it an assessment by the model?

This transparency isn’t just a technical feature – it’s the prerequisite for trust. Anyone basing business decisions on AI-supported analyses must know what they’re relying on.

The Path Forward

Business intelligence with AI is neither utopia nor hype – it’s an architecture challenge. The technology is mature, the models are powerful, the interfaces exist. What many companies lack is a thoughtful approach that leverages the strengths of LLMs without falling prey to their weaknesses.

The future belongs to systems that intelligently connect structured databases, document knowledge, and language models – while always making transparent what is fact and what is interpretation. Companies that find this balance gain more than just another analytics tool. They gain a real sparring partner for better decisions in difficult times.

And yes – that’s exactly what we’re working on.

The Lean Revolution: Why Small Language Models will dominate 2026

Faster, cheaper, more controllable – and still powerful: Small Language Models are conquering the enterprise space.

While the world obsesses over GPT-5 and ever-larger models, something exciting is happening in the background: Small Language Models (SLMs) are evolving rapidly and becoming a real alternative for enterprise applications. In our latest webinar, we showed why – and demonstrated our own fine-tuned models live.

The Problem with the Big Ones

80-95% of all corporate AI projects fail. A sobering number that keeps making headlines. But why?

A major reason: Large language models like ChatGPT or Claude are often problematic for enterprise use. OpenAI recently switched off all legacy model variants when releasing GPT-5 – a nightmare for any corporate IT with running processes. Add data privacy concerns, unpredictable behavior, and dependency on American cloud services to the mix.

Small but Mighty: The Advantages of SLMs

Small Language Models (typically 1-20 billion parameters) offer tangible benefits:

⚡ Speed: Responses in milliseconds instead of multi-second waits. Once you’ve experienced the responsiveness of a local SLM, there’s no going back.

🔒 Privacy: Runs on your own servers, needs no internet connection, no data leaves your premises. Ideal for sensitive corporate data.

🎯 Control: No surprise model updates, no sudden behavioral changes. The model does exactly what it’s supposed to do.

💰 Cost: Significantly cheaper to operate than API calls to major providers.

🔧 Customizability: Through fine-tuning, SLMs can be precisely trained for specific tasks – with manageable effort.

The Secret Sauce: LoRA Fine-tuning

The game-changer is called LoRA (Low-Rank Adaptation). This technique makes it possible to customize models with surprisingly little data (from ~100 examples) and compute power. The principle: You only train a small “adapter” that’s layered over the model weights – no retraining of the entire model required.

The result? A model that not only gives the right answers but responds in exactly the right style. Anyone who’s tried to get ChatGPT to give shorter answers or avoid certain formatting through prompting alone knows how difficult that is. With fine-tuning, it works reliably.

Live Demo: Our Own SLMs

In the webinar, we showed three fine-tuned models, all based on LiquidAI’s LFM-2 with just 1.2 billion parameters:

  1. General German Model: Solid answers to everyday and technical questions
  2. Fritz Perls Therapy Bot: A model that perfectly imitates the confrontational conversation style of Gestalt therapist Fritz Perls
  3. Market Research Association Model: Analyzes implicit brand associations in professional market research style

The responsiveness is impressive – answers come practically instantly. And the best part: Everything runs on our own European servers.

The Future: Hybrid is King

Our vision at HybridAI: It’s all about the combination. Small, fine-tuned models for routine tasks, large models for complex queries – orchestrated by an intelligent control layer that recognizes which model is right for each situation.

This gives enterprises the best of both worlds: Fast, controllable, privacy-compliant answers for 80% of queries – and the power of large models when truly needed.

Want to Try It Yourself?

We’re making our SLM demo publicly available. Test for yourself how the small models perform – and contact us if you’d like to discuss custom fine-tuned models for your use cases.

🚀 HybridAI + N8N: Your AI Agent Just Got Seriously Agentic 🚀

Today marks a huge milestone for our HybridAI platform: we’ve fully integrated N8N – and it’s a game changer for anyone working with automation and intelligent agents.

What’s new?

🔗 Deep integration with N8N workflows
Every HybridAI user now gets free access to our dedicated N8N server. Even better: from inside any N8N workflow, you can now send a Function Call directly to your chatbot or agent – with a single click.

Example:
“Send a follow-up email to all leads from today.”
→ Your bot instantly triggers the corresponding N8N workflow.

Why does it matter?

Agentic AI means that your bot doesn’t just talk, it takes action. It can now handle complex workflows, launch services, update databases, and more – autonomously.

To achieve this, you need two things:

  1. A smart control center → your HybridAI agent
  2. A powerful action engine → N8N

Now you get both, perfectly connected.

What is N8N, anyway?

N8N is a no-code automation tool developed in Berlin. With it, you can:

  • Connect APIs and AI models
  • Read/write Google Docs
  • Send emails
  • Query or update databases
  • Build custom nodes for anything else

And now, your HybridAI chatbot can trigger it all seamlessly from any conversation.

How to get started?

If you have a HybridAI account, just go to your “AI Functions & Actions” section in the admin area and create a Function Call pointing to your N8N webhook. That’s it – your bot is ready to act.


🎯 Try it now and explore new levels of automation with HybridAI + N8N.

New IoT Integration: Real-World Data Meets Conversational Intelligence (2026 Update)

Update 2026: As of now, we also support MQTT sensor data. More importantly, we have connected our IoT sensor infrastructure to our BI solution. This means that data can now not only be read and reported, but also analyzed and evaluated in a multi-dimensional way.

We’re excited to introduce a powerful new feature on our platform: the ability to stream IoT sensor data directly into your chatbot’s context window. This isn’t about triggering an external API tool call—it’s about augmenting the bot’s real-time understanding of the world.

How it works

IoT sensors—whether connected via MQTT, HTTP, or other protocols—can now send live data to our system. These values are not fetched on-demand via function calls. Instead, they’re continuously injected into the active context window of your agent, making the data instantly available for reasoning and conversation.

Real-World Use Cases

🏃‍♂️ Fitness and Weight Loss

A health coach bot can respond based on your real-time activity:

“You’ve already reached 82% of your 10,000 step goal—great job! Want to plan a short walk tonight?”

Or reflect weight trends from smart scales:

“Your weight dropped by 0.8 kg since last week—awesome progress! Should we review your meals today?”

⚡️ E-Mobility and Charging

A mobility assistant knows your car’s charging state:

“Your battery is at 23%. The nearest fast charger is 2.4 km away—shall I guide you there?”

Bots can also keep track of live station availability and recommend based on up-to-date infrastructure status.

🏗 Accessibility and Public Infrastructure

A public-facing city bot could say:

“The elevator at platform 5 is currently out of service. I recommend using platform 6 and taking the overpass. Need directions?”

Perfect for people in wheelchairs or with limited mobility.

🏭 Smart Manufacturing and Industry

A factory assistant can act on process data:

“Flow rate on line 2 is below target. Should I trigger the maintenance routine for the filter system?”

This allows for natural language monitoring, error detection, and escalation—all in real time.

What Makes This Different?

🔍 Contextual Awareness, Not Tool-Calling
Sensor data is part of the active reasoning window—not fetched via a slow external call, but immediately available to the model during inference.

🤖 True Multimodal Awareness
Bots now reason not just over language but also over live numerical signals—physical reality meets LLM intelligence.

🚀 Plug & Play Integration
Bring your own sensors: from wearables to factory machines to public infrastructure. We help you connect them.

In Summary

This new feature unlocks unprecedented potential for intelligent agents—combining the power of conversational AI with a live, evolving understanding of the physical world. Whether you’re building a wellness coach, a mobility assistant, or an industrial controller, your agent can now think with real-world data in real time.

Reach out if you’d like to get started!