About Me

Featured

I am passionate about using Cloud Technology.

So any questions/ feedback – please Get In Touch

Copyright ©  in accordance with the Copyright, Designs and Patents Act 1988. All rights reserved. You are free to use any of the content here for personal use but need permission to use it anywhere or by any means (electronic, mechanical, photocopying, recording or otherwise).

AgentDBA vs Critical SQL Server

It’s 07:43. Someone’s already left a message. “Something’s wrong with the DB server.” You open the terminal and go to work.

This post is about what happens next — and why I built AgentDBA to handle it the way it does.

Read more: AgentDBA vs Critical SQL Server

What is AgentDBA?

AgentDBA is a self-hosted, CLI-based autonomous diagnostic reasoning engine for SQL Server. It is not a monitoring tool. It does not sit in the background polling metrics and firing alerts. You invoke it, it investigates, it reasons, and it tells you what it found and why.

The distinction matters. Monitoring tells you something happened. AgentDBA tells you what it means and what to look at next — grounded entirely in evidence it collected itself. No speculation. No plausible-sounding guesswork dressed up as a finding.

How it thinks

AgentDBA isn’t a script with an LLM bolted on. It reasons across multiple steps, remembers what it has seen before on your server, and decides for itself when it has enough evidence to conclude. Every action it takes is auditable — you can reconstruct exactly what it did and why without ever touching the LLM again. The LLM never receives raw SQL data. It receives structured, pre-processed findings. What it does with those findings is reasoning, not retrieval.

The non-negotiable: evidence first

The agent does not fill gaps with inference. Every finding must trace directly to data it collected. If the cause cannot be proven from what it holds, it says so — explicitly. That rule is not a prompt suggestion. It is enforced at code level.

When AgentDBA encounters a database integrity event or a critical error log condition, it doesn’t deliberate. It escalates immediately and ends the session. No reasoning loop. No sweep across other modules. Just the finding and a human on the hook.

When the critical class is clear, AgentDBA investigates the rest of the server. It selects which areas to examine, reasons across what it finds and concludes with severity and confidence — or with explicit uncertainty where evidence doesn’t support a conclusion. Every tool call, every decision, and every raw result is written to audit telemetry. The session is fully reconstructable from the database alone.

A clean server first

Before I introduce chaos, here is what AgentDBA looks like against a healthy server. I connect via Windows Auth, the LLM connects to Azure OpenAI GPT-5.4, and I run a full health check.

The server is clean. The only thing it surfaces is a historical finding: this server previously had a transaction log space warning on msdb, which I resolved. The agent knows this — it logged the episode, marked it resolved, and references it in context. It is not re-raising an issue that no longer exists.

Clean bill of health. Now let’s break things.

Scenario: The Critical Storm

Many general failures but hidden deep within the group of issues are the dreaded Error 824 and Error 825 mixed with a RECOVERY_PENDING database means it cannot be accessed until someone intervenes.

These are not “look into it when you get a chance” events. These are the events that need looking at right away. I’ve staged exactly this on my VM server: a flood of errors (50+) in the SQL Server error log, 825 read retry events buried within them, and a database sitting in RECOVERY_PENDING. This is the kind of noise that hides the thing that matters.

I run the health check.

AgentDBA never reaches the LLM.

AgentDBA finds the CRITICAL conditions and short-circuits immediately. The RECOVERY_PENDING database is flagged. The 824/825 errors are flagged. The session ends. Escalation fires.

This is an explicit design decision. I do not want a reasoning loop when a database has a potential integrity event. I want to know immediately. Tell me. I’ll deal with it everything else can wait.

The escalation router has hooks for Slack and Teams — not wired up in this demo, but the architecture is there. A CRITICAL finding should be in your on-call channel before you’ve finished your coffee.

Fixing the critical issues

I work through the RECOVERY_PENDING database. I address the underlying cause and the critical checks now clear.

Missing Backups?

Once fixed, you call the agent again. With the critical class clear, AgentDBA now reasons across all other modules including failed jobs, backup compliance, log file health — and synthesises the finding.

It comes back with a finding. Not catastrophic. But real. SALESDB has no backup recorded.

This hadn’t surfaced during the first run — not because the agent missed it, but because the critical-class check is intentionally scoped. It isn’t a full severity sweep. It is locked to a specific class of problem: database integrity and high severity errors that require immediate human intervention before anything else runs.

 RECOVERY_PENDING and 824/825 errors are in that class – from my POV, these are non-negotiables, I used to have nightmares when confronted with potential corruption. A broken backup chain, even though important, is a different category of problem — serious yes but not corruption serious.

IMPORTANT – When you call a specific module directly, AgentDBA focuses there. When you call a full health check, it prioritises catastrophic conditions first before broader investigation.

Root cause — and when it stays null

It will not connect the missing backup to something like a job failure or make something up. It needs direct evidence otherwise this is fabrication. So, it will say root cause unknown or words to that effect.

That boundary is not a limitation. It is the feature. A diagnostic tool that invents causal chains sends you to fix the wrong thing.

I fix the backup chain. I run again.

Everything is clean.

What this is

AgentDBA is not trying to replace your judgment. It is trying to make sure that when something is wrong, you are looking at the right thing and the right time within sixty seconds — not an hour later after manually correlating error logs, backup history, and job history across three SSMS windows.

The findings are evidence bound. The reasoning is auditable. And the longer it runs on your estate, the more context it carries. That is the point.

Database AI Agents: The Read-Only Rule

Fourth in a series on Ai and databases.

What Read-Only Advisory Actually Means

A read-only advisory system does exactly two things: it observes and it recommends. It never acts.

It can tell you a database is in a suspect state. It cannot attempt to recover it. It can tell you a backup chain is broken. It cannot re-run the job. It can identify that a transaction log is growing uncontrollably. It cannot intervene. That boundary is not a limitation. It is a deliberate design decision — and arguably the most important one when building AI for production database environments (in my opnion).

Why It Matters

Production SQL Server instances serve real applications, real users, and real business processes. A wrong action — even a well-intentioned one — can cause an outage, breach compliance, or trigger a recovery scenario nobody wanted. The only rational starting point is observation and recommendation. Let the DBA decide. Let the DBA act. The agent’s job is to make that decision faster and better informed — not to make it for them.

Trust Has to Be Structural

A trustworthy diagnostic agent is built so that the boundary between observation and action is not a setting, a configuration flag, or a prompt instruction. It is structural.

The agent should be architecturally incapable of writing to user data, running DDL, killing sessions, or executing any SQL it generated itself. Not because it has been told not to. Because the capability simply does not exist in the design. That means read-only diagnostic queries only. Pre-approved, version-controlled, human-reviewed. No runtime SQL generation reaching a cursor. The agent reasons over what it observes and hands a recommendation to a human. The human decides what happens next.

That is the only posture that earns enterprise trust — and it should be non-negotiable for any AI system pointed at a production database.

Why Database AI Agents need Layers?

Third part in my Ai series with databases.

When building AI solutions within the database realm the first thing that people do is a straightforward concept, connect the LLM endpoint to a SQL backend let it run queries and become a LLM wrapper.  It works for demos, fails in the real world.

Why is this pattern bad? Couple of reasons – It generates adhoc SQL, the model decides what it wants to code and how to query and parse the data back.  It’s a context window killer. Do you know how heavy the RAW result sets on a msdb query are when you have hundreds of jobs? I have seen token budgets get eaten up very fast. Probably the worst thing without any form of control is the hallucination connection. The model, whether Claude or GPT will absolutely make a casual link between a failed job to a breached backup RPO even when there is no link – I have even asked it why it did that it replied, apologies I just linked the two.  Most importantly, it’s a security issue. Imagine you  are sending back and forth sensitive metadata about key objects within your database system its massive area of concern.

Defence Layers

Architecturally I design to four layers to address these problems to provide a secure confident experience, these are – Entry layer, Context layer, Tool layer and the validation layer.

Each layer is a guardrail that doesn’t depending on the mood of the LLM. The model can’t run dangerous code because it never writes TSQL. It cannot flood the context window because of tool contracts, and it cannot link 2 things together just because it feels like it due to validation layer.

Key design principle?  Guardrails must be structural – NOT behavioural.

The trade off is complexity, this isn’t a something you can design and build in couple of days but how do you build trust if you don’t? The effort is absolutely needed.

AI Insights for Database Monitoring: A DBA’s Perspective

Second in a series on Ai and databases.

One Story, three signals – I have a backup of a critical database that has failed three times, the recovery point objective has been breached, the transaction log has a LOG_REUSE state I seem to have failed SQL agent jobs too. Monitoring tools could / should pickup these alerts but I still have the decipher everything. At 2am I could do with some help.

During these callouts there are many thoughts on my mind – are the databases ok (online and readable)? what’s going on in the error logs? Any issues with overnight agent jobs? Are the logs growing out of control – where are these errors coming from, these are basic DBA 101s but together they form a picture. An agent in this world forms a hypothesis after it understands the first signal, tests it with another one and connects the dots based on evidence – we don’t want the LLM to start fabricating things, do we?

Comparing to traditional monitoring tools / scripts it is the same input data but completely different output – not just what but why.

Claude Context Window

Every Claude conversation has a context window. It is the total amount of text Claude can work with in a single chat — your messages, its replies, uploaded files, tool results, and system instructions all count.

A few habits make a big difference.

1. Avoid endless refinement loops

If you are constantly asking Claude to rewrite, adjust, shorten, expand and rework the same answer the conversation can become noisy. Sometimes it is better to edit the original prompt with clearer instructions, or start a fresh chat with the improved version of the task.

This gives Claude a cleaner starting point

2. Use Projects for repeated context

If you keep pasting the same background information into Claude, use a Project.

Projects let you keep related chats, instructions and uploaded knowledge together. This is useful for recurring work such as a product, blog series, codebase, research topic or business process.

3. Be selective with files and links

Large documents, PDFs, long articles and big code files can consume a lot of context.

That does not mean you should avoid them. It means you should be specific. Tell Claude which sections matter, what question you are trying to answer, and what output you want. A focused request usually produces a better answer than dumping everything into the chat and hoping the model works out what matters.

4. Keep setup simple

Tools can add extra content into the conversation. Web search, MCP tools, connectors and code execution can all be useful, but they can also add tool results, logs, retrieved content or intermediate outputs into the working context.

If a task does not need a tool, keep the setup simple.

5. Use extended thinking for harder work

Extended thinking is best used for more complex tasks: debugging, planning, architecture, analysis, reasoning-heavy writing or multi-step problem solving. For simple questions, it may add latency and extra token usage without much benefit.