Evaluation-Set for every Customer

Today we launched a new feature in the Prompt-Tuning-Clinic – the “Evaluation Criteria” Section.

It’s one of the most annoying things in AI to hunt for the question whether a custom configured AI (ChatBot, Agent, Automation) is doing well or not. In most cases both suppliers and customers are treating it like this:

“Yesterday i did run this prompt against it, and it looked really well, good progress!” – or – “My boss asked it to do x and it gave a total wrong answer, we have to redo the whole thing!”

Its an inherent problem of AI to some extent, for one because of the universal capability of these systems and the fact that you can ask practically everything and will always get an answer. And – due to the non-deterministic architecture and functioning of these systems it is very hard to define what it is doing and what not.

We were a bit tired of this, and so we thought – why are we reading LLMarena (btw – we launched german LLM-Arena recently, try it here) and other rankings of new AI models every second day and dont apply similar mechanisms to our customer installations?

This is exactly what this new feature brings:

  • define a couple of test-prompts (you can upload some treatment material like your API-Documentation or an md-file of the Website and let the AI make proposals for test-prompts)
  • run these prompts against the current configuration of the bot
  • Evaluate them (can also be done with an LLM automatically)
  • Define correct answers for edge cases
  • Save those prompts that are important permanently
  • Give them thumbs up/down to create cases for Fine-Tuning and DSPy
  • Run them all to get a quality ranking

Once this is set up the game is changing drastically, because now we (both supplier and customer) do have a well defined test-set of intended behavior that can be run automatically.

This is not only good for initial setup of a system, but also for Improvements, Model-Updates, new Settings etc.

And: as we are also offering fine-tuning for our models and have integrated DSPy as automated Prompt-Tuning tool you can create training-data for these while creating your Evaluation-Set as well – just thumbs up/down on the answer creates an entry in the test-database for later.

Sign up for a free Account and try it out!

“Send me an email about this please.”

Everyone is currently fascinated by the developments in AI-based agent systems – even though it’s clear that a good part of it will be hype and nonsense.

But generally, the idea that AI is not just for chatting but can also do real things outside the chat window is good. We’ve already shown how the HybridAI system can, for example, call API functions in the background and how it is possible to control elements on a website via commands we developed, directly from the chat box.

Today, a new feature is being added, which, of course, is somewhat inspired by the current race for the best deep-research bot, but not just by that.

From now on, HybridAI bots can also send emails – but not ordinary emails, rather AI-based ones – and this is done with the currently most exciting ChatGPT competitor, Perplexity (which has also just released its deep-research agent). This is exciting because Perplexity is, on the one hand, a state-of-the-art LLM (a Llama variant, or alternatively, deep-seek). On the other hand, they make a big effort to be more current than all other LLMs, meaning up-to-date!

That’s why our first example yesterday was: “Send me a summary of JD Vance’s speech at the Munich Security Conference.” That was available just minutes(!) after the speech was held. But see for yourself:

We believe this will be very useful for certain bots, so for example, in the school sector, a student could say: send me a brief description of the topic ‘Präteritum.’ Or the Veggie-Diet bot could offer to send an email with a weekly plan:

We will next link this functionality with the system instructions of the bot so that this is taken into account when generating the email. There will also soon be “scheduled tasks” based on this, so something like “please send me a reminder every morning about my diet plan and a few meal suggestions.”