HybridAI Blog - Insights on AI, Chatbots & Automation

AI in Accounting: What Actually Works in 2026

Everyone’s talking about AI in accounting now. The consultants have their slides ready. The vendors are rebranding their invoice OCR tools. LinkedIn is full of posts about “How I automated my finance tasks – comment FINANCE to get my n8n workflow.” Wrappers everywhere.

Here’s my honest take after working on the technical AI side of this stuff for a while: AI will absolutely change accounting. But not in the way most people think, and probably not as fast as the headlines suggest.

The terminology mess

Let’s start with the basics, because the language around this is a disaster. When people say “AI in accounting,” they could mean any of the following:

Rule-based automation: If invoice amount > 10,000, route to CFO. If “Intra-community supply under § 4 1b”, then “VAT_00”. Not AI. Just code with a marketing budget.

Machine learning classifiers: Trained models that categorize transactions based on patterns. These are actually useful and have been around for years. But they often don’t generalize well and are hard to keep up to date, because often they’re a black box.

OCR and document extraction: Reading invoices and pulling out vendor names, amounts, dates. This is common practice now. And in some instances it actually works.

Large Language Models: Our favorite brand new toy. GPT, Claude, Gemini. Can understand context, interpret messy inputs, handle edge cases. But also: can hallucinate numbers with absolute confidence.

Most “AI accounting” products today are really just ML classifiers with an LLM-powered chatbot stapled on top. Which is fine, but let’s be honest about what we’re dealing with.

Where AI already works today

Behind all the marketing talk, there are real wins happening right now:

Document capture and extraction. Modern systems can read invoices in any format, any language, any level of scan quality. The combination of vision models and LLMs has basically solved this problem. You still need human review for edge cases, but 80-90% straight-through processing is achievable. And if you get an invoice in a new format, it still works. Because AI.

Transaction categorization. For standard cases, ML models are excellent at learning your chart of accounts and applying it consistently. They don’t get tired on Friday afternoons. They don’t have “creative” interpretations of cost centers. Also: don’t forget that AI doesn’t necessarily means LLMs. We also have an awesome family of sub 1b parameter encoder only models that are wonderful at classification.

Anomaly detection. Spotting duplicate invoices, unusual amounts, vendors that suddenly changed bank details. Pattern recognition at scale is exactly what ML does well. This is genuinely useful for fraud prevention and audit prep.

Natural language queries. “Show me all marketing expenses over 50k last quarter” without writing SQL. This works now. It’s not magic, but looks like magic and really saves time. Why not chat with your business data 😉

The common thread? These are all tasks where being approximately right most of the time is valuable, and where humans can easily verify the output.

Where things get interesting (and dangerous)

Now for the hard part.

The moment you need AI to make a decision that has legal or tax implications, everything changes. Consider VAT determination on an incoming invoice. Sounds simple: it’s 19%, right?

Except when it’s not. Is the supplier in another EU country? Is this a service or a good? Does reverse charge apply? Is it construction-related (§13b in Germany)? Is the supplier even VAT-registered? Is it a triangular trade? Is there a pandemic with special vat rates?

I’ve written about this specific problem with tax codes in SAP before. The short version: there are dozens of edge cases, and getting it wrong means audit findings, back taxes, and possibly fraud allegations.

Here’s the uncomfortable truth: LLMs are very good at explaining what reverse charge is. In academic voice or as a sonnet. But they’re dangerously unreliable at determining whether a specific invoice should use it. The difference matters.

The hallucination problem is real. An LLM will confidently tell you that this invoice clearly qualifies for intra-community supply treatment. It might even cite the relevant EU directive. It might also be completely wrong, because it didn’t notice the supplier has a German VAT ID, or because the goods never actually left the country. I ran a couple of examples through different LLMs – and they were very opinionated about certain things. But not necessarily right. So right now, we’re creating a VATBench to get a better view of this.

When Claude or GPT makes a mistake in a creative writing task, you get a weird sentence. When it makes a mistake in tax determination, you get a six-figure assessment in your next audit.

The hybrid AI architecture that actually works

So where does this leave us? Not with “AI bad, humans good.” The answer is architectural. A pattern that really works magic combines three things:

LLMs for interpretation. Let the language model read the invoice, extract the relevant facts, classify the transaction type, identify the supplier’s jurisdiction. This is what they’re good at – information extraction!

Structured rules for decisions. Tax law is not creative. It’s a decision tree with many branches but clear logic. Once you have the facts, applying the rules should be deterministic. No creativity needed. No hallucination possible.

Transparent audit trails. Every decision needs to document why it was made. Which invoice fields were extracted. How the supplier was classified. Which rule determined the tax code. When the auditor asks, you need answers.

The key insight: don’t ask the LLM what the tax code should be. Ask it to extract the facts, then apply your rules. It’s not half as sexy as “our AI automatically handles everything.” But it works.

What this means for CFO offices and finance teams

A few practical conclusions:

You’re not getting replaced. The “AI will automate away accounting” takes are mostly written by people who’ve never closed a month-end.

Your job is changing. Less data entry, more oversight. Less manual matching, more exception handling. Less typing, more thinking. If you’re spending 60% of your time on tasks that could be automated, you should definitely talk AI.

You need to understand the tools. Not how to build an LLM from scratch (even this is super fun to do). But how they work, where they fail, what they can and can’t do. The finance leaders who thrive will be the ones who can evaluate AI vendors with real technical understanding.

Start with contained problems. Don’t try to “AI-enable the entire finance function.” Pick one painful process with clear success criteria. Invoice capture. Expense categorization. Intercompany matching. Get that working, learn from it, then expand.

The bottom line on AI in accounting

AI in accounting is real, useful, and overhyped all at the same time. The technology works for information extraction, pattern matching, and natural language interfaces. It doesn’t work—not safely—for unsupervised decision-making on anything with legal consequences.

The winning approach combines the interpretive power of LLMs with the precision of rule-based systems and the oversight of human experts. It’s less exciting than “fully autonomous AI accounting” but it’s what actually ships, actually works, and actually survives audits.

Evaluation-Set for every Customer

Today we launched a new feature in the Prompt-Tuning-Clinic – the “Evaluation Criteria” Section.

It’s one of the most annoying things in AI to hunt for the question whether a custom configured AI (ChatBot, Agent, Automation) is doing well or not. In most cases both suppliers and customers are treating it like this:

“Yesterday i did run this prompt against it, and it looked really well, good progress!” – or – “My boss asked it to do x and it gave a total wrong answer, we have to redo the whole thing!”

Its an inherent problem of AI to some extent, for one because of the universal capability of these systems and the fact that you can ask practically everything and will always get an answer. And – due to the non-deterministic architecture and functioning of these systems it is very hard to define what it is doing and what not.

We were a bit tired of this, and so we thought – why are we reading LLMarena (btw – we launched german LLM-Arena recently, try it here) and other rankings of new AI models every second day and dont apply similar mechanisms to our customer installations?

This is exactly what this new feature brings:

define a couple of test-prompts (you can upload some treatment material like your API-Documentation or an md-file of the Website and let the AI make proposals for test-prompts)
run these prompts against the current configuration of the bot
Evaluate them (can also be done with an LLM automatically)
Define correct answers for edge cases
Save those prompts that are important permanently
Give them thumbs up/down to create cases for Fine-Tuning and DSPy
Run them all to get a quality ranking

Once this is set up the game is changing drastically, because now we (both supplier and customer) do have a well defined test-set of intended behavior that can be run automatically.

This is not only good for initial setup of a system, but also for Improvements, Model-Updates, new Settings etc.

And: as we are also offering fine-tuning for our models and have integrated DSPy as automated Prompt-Tuning tool you can create training-data for these while creating your Evaluation-Set as well – just thumbs up/down on the answer creates an entry in the test-database for later.

Business Intelligence in the AI Era in 2026: Opportunities, Risks, and the Architecture Behind It

Let’s be honest: Does your company have all business-relevant information available at the push of a button? Or is it also stuck in various data silos, largely unconnected – the ERP here, the CRM there, plus Excel spreadsheets on personal drives and strategy documents somewhere in the cloud?

If you’re nodding right now, you’re in good company. I regularly speak with CEOs and finance leaders, and the picture is almost always the same: The data would be there. But bringing it together to answer a specific question takes days – if anyone can do it at all.

Why This Is Becoming a Problem Right Now

The days when companies could rely on stable markets and predictable developments are over. Inflation, geopolitical tensions, disrupted supply chains, a labor market in flux – all of this demands a new discipline: Decisions must not only be good, they must be good fast.

Traditional business intelligence has a proven answer to this: dashboards, KPIs, monthly reports. But let’s be honest – these tools hit their limits as soon as questions get more complex. What happens to our margin if we switch suppliers? How does a price increase affect different customer segments? What scenarios emerge if the euro keeps falling?

Questions like these need more than static charts. They need a real conversation with your own data.

The Temptation: An AI Sparring Partner for Your Decisions

This is exactly where generative AI gets really exciting. The idea is compelling: An intelligent assistant that knows your company’s numbers, understands connections, and lets you explore strategic options – anytime, without scheduling, without someone having to build an analysis first.

“How did our top 10 customers develop last quarter?” “What if we reduced the product portfolio by 20%?” “Compare our cost structure with last year and show me the biggest outliers.”

A dialogue like this would democratize business intelligence. Not just the controller with their Excel expertise would have access to insights – every decision-maker could query the data themselves. I still find this idea fascinating.

The Problem: When AI Hallucinates, It Gets Really Expensive

But – and this is a big but – here’s the crux. Large Language Models are impressive at generating plausible-sounding answers. They’re considerably less reliable at delivering factually correct answers. Especially when it comes to concrete numbers.

An AI that misremembers a date in a creative text? Annoying, but manageable. An AI that invents a revenue figure or miscalculates a margin during a business decision? That can really hurt. The danger multiplies because the answers are so damn convincing. We humans tend to trust a confidently delivered statement – even when it comes from a statistical language model.

I say this from experience: A naive integration of ChatGPT with company data is a risk, not progress. Anyone who sees it differently has either been lucky or hasn’t noticed yet.

The Technical Challenge: Connecting Three Worlds

The solution lies in a well-thought-out architecture that intelligently brings together three different data sources:

Structured data via SQL: The hard facts – revenues, costs, quantities, customer histories – typically reside in relational databases. Here, the AI must not guess but query precisely. The system must generate SQL queries, execute them, and correctly interpret the results. No room for creativity.

Unstructured data via RAG: Beyond the numbers, there’s context – strategy papers, market analyses, internal guidelines, meeting notes. These documents can be accessed through Retrieval Augmented Generation: The system searches for relevant text passages and provides them to the language model as context.

The model’s world knowledge: Finally, the LLM brings its own knowledge – about industries, economic relationships, best practices. This knowledge is valuable for interpretation, but dangerous when mixed with concrete company figures.

The art lies in cleanly separating these three sources and making transparent where each piece of information comes from.

The Solution: Everything into the Context Window

Modern LLMs offer context windows of 100,000 tokens and more. This opens up an elegant architectural approach: Instead of letting the model guess which data might be relevant, we proactively load all needed information into the context.

A well-designed system works in several steps: It analyzes the user’s question and identifies relevant data sources. Then it executes the necessary SQL queries. In parallel, it searches the document base via RAG. And finally, the LLM receives all this information served up together – with clear labeling of sources.

The language model thus becomes an interpreter and communicator, not a fact generator. It can explain numbers, reveal connections, ask follow-up questions, discuss options for action – but it doesn’t invent data, because the real data is already in the context.

Transparency as a Design Principle

Such a system must build transparency into its DNA. Every statement about concrete numbers should cite its source. The user must be able to trace: Does this come from the database? Was it quoted from a document? Or is it an assessment by the model?

This transparency isn’t just a technical feature – it’s the prerequisite for trust. Anyone basing business decisions on AI-supported analyses must know what they’re relying on.

The Path Forward

Business intelligence with AI is neither utopia nor hype – it’s an architecture challenge. The technology is mature, the models are powerful, the interfaces exist. What many companies lack is a thoughtful approach that leverages the strengths of LLMs without falling prey to their weaknesses.

The future belongs to systems that intelligently connect structured databases, document knowledge, and language models – while always making transparent what is fact and what is interpretation. Companies that find this balance gain more than just another analytics tool. They gain a real sparring partner for better decisions in difficult times.

And yes – that’s exactly what we’re working on.