How I AI - Automated AI Interactive Review

TLDR; I’ve been using AI to interact with a web application and find defects. And it does interact with the application, keep focus on the specific story context, then finds and raises defects.

The experiment is: how much value can I get from a single Goal oriented prompt in Codex which spins up sub-agents to review a story, and associated PR, to try various interactions on the the deployed site and report defects.

Is that testing? I’m not ready to call it testing yet, but I use a lot of the words associated with testing to describe it. At the moment I think of it as an automated interactive review.

Video Overview

In the video I provide an overview of the experiment, where to find the prompt and examples of the types of artifacts that it creates.

Artifacts:

readable logs from agents
defect reports
defect replication videos
summary reports
issue screenshots

What is this?

I’ve seen a lot of people experimenting with Agentic technology. Letting the AI loose on the application for coding, review, raising issues, prioritising issues, fixing issues, etc.

A good example of Agentic technology in the Testing process can be found here:

I’m not quite ready to that yet.

I’m trying to look at how AI can fit into a normal team process with multiple team members and existing development tools.

Because those are the type of teams I work and consult with.

As such, I’m building a process around Github:

Github issues to represent Stories, Defects, Tasks
Planning using Github Project
Github PR for reviews prior to accepting code
CI pipeline using Github Actions
Deployed Test Environment (using Github pages)
a live environment

You could replace Github with Jira or other tools, but the concept here is - human tools and processes for planning and communication. AI uses those tools to communicate back to humans.

I’ve been using AI Coding Tools to create code. I wanted to see if I could use AI tooling to identify defects through interacting with the application. And then… could it communicate those defects using the Human oriented process.

I’ve been surprised at how successful this has been.

I am documenting all the execution sessions and the evolution of the goal prompt and tooling over at:

https://github.com/eviltester/anywaydata-ai-testing

The main experiment is:

ai-review-goal.md

And session execution results in:

/docs/testing

Process

Create Github Issues to track work
- have AI work on the Github issues using branches for the stories
When ‘done’ create a PR to review the work
- I also have various review tools scanning the PR but that requires a different blog post.
Trigger the experimental Goal to review the story, review the PR, interact with the application on the test environment using Playwright and report back finding relevant to the issue and PR.

The prompt is a little larger than this, but that’s the gist.

The process requires that the AI coding tooling can access Github.

I’m using Codex so it has Github capabilities built in, but it also uses HTTP to access issue JSON feeds, and it uses local Git and Github tooling for pulling code and writing comments on Github.

I’ve also setup a variety of MCPs. Playwright MCP and Chrome Dev Tools MCP - this is to allow interaction with the application using a browser.

Playwright, node etc. are all installed on my machine so it also writes adhoc scripts to automated the application and record screenshots and videos as it works.

I currently run this on Codex using 5.5 on High reasoning in fast mode - so if I was using an API plan to do this it might be quite expensive.

At the end of running the session the AI tooling finishes by:

collating reports and logs into PDF
reporting the results by creating a Github issue in the target folder
creating sub-issues for any defects found

All reports are written to the anywaydata-ai-testing project.

Videos and support files are left on my local machine - because I’m trying not to clutter the Github repo with too much temporary information. But if this was working with Jira or a team environment then I’d either have the defect videos as attachments or find a way to share the videos across the team via some video sharing tool.

Hmmmm, idea, have a prompt that uses ffmpeg to create a video summarizing the execution session and including all defect reports and upload to YouTube or some other video hosting. This isn’t far fetched because I have used ffmpeg to create videos of defect replication, and I do have other experiments for creating videos automatically.

I initially had the AI creating replication videos with ffmpeg but then had it use the Playwright APIs and now uses the Playwright screencast API (which has automated chapter points to explain what is happening).

Results

Every time I’ve run this, it has found useful defects.

Some that I might not have fixed in the past because they seem low priority, but with AI, a fix is a prompt away so it isn’t too much work to fix most items.

Some I might not have fixed because they are low priority design decisions, but… AI seems biased in its exploration patterns and keeps identifying the same type of issues, so I fix them or re-design the system to make it easier to use and understand. Something I might not have done for primarily human use - because humans can adapt more easily.

I do not let the defects flow directly into another AI session. I triage them on Github and close those that are unimportant. I also group them to identify which ones are actually related to ‘design’ and which are issues.

I don’t let the AI act on them directly because I don’t know what decisions it would make to ‘fix’ those reported issues which are design related. I did try this in the past but it made what I considered to be the ‘wrong’ design decision.

Each execution session takes about 40 minutes for the AI tooling to complete and has resulted in 4-6 issues being raised each session.

Human in the loop for decision making.

Logs

Each agent writes log files e.g.

These vary in style and consistency, some are slightly malformed for viewing as Markdown in github. But they are intended for Human consumption.

The AI tooling is prompted to write timestamps, intent, actions and results.

I tend to sample these reports. Some are more interesting than others.

I find the logs most useful when they have tables of sections of data because I can often spot issues in the data and results that the AI doesn’t identify as an issue.

Evolution

The goal prompt I’m using evolves after each execution because I see something that didn’t work well enough, or I want a different output.

When I have an idea of what I want it to do I experiment with prompts before adding into the goal. e.g. when I wanted to have videos of defect replication I interacted with the AI tooling to find different approaches.

a simple - create a video of the defect worked
but then I wanted captions and explanation
it first used ffmpeg to edit an existing video, and it worked but took more time, felt risky for flakiness and I thought there should be a better way
then tried Playwright with the CDP API and that created good videos with the intent clearly visible but seemed complex
then found Playwright screencast API which creates slides with a delay through a simple supported API and so settled on this. The output was not as good as the direct CDP API control, but it seemed simpler and more reliable.

I’ve also added an agents.md to try and prevent the AI from committing support material and videos. It tends to bypass guardrails like .gitignore and so I’ve also add pre-commit hooks.

Comparison to commercial tooling

I haven’t used a lot of AI Commercial Testing Tooling. The evaluations are too small or limited, the tools are in beta and don’t work effectively on some environments, they don’t fit the style of approach I want to use.

Of the tools I’ve tried. I’ve had 1 or two points of action, but not as much value as I’ve had from this goal.

Commercial tooling offers a nicer UI, but I can’t imagine it is using as heavyweight an AI model so it has to have a better surrounding set of co-ordination activities, prompts and supporting tools to get the most out of this.

The tools I tried did not hook into Github and a normal human process… or if they did I didn’t see this in the limited evaluation time or it wasn’t accessible during an evaluation.

Most commercial tools seem to be described in terms of ’testing’ and ‘self-healing’ i.e. they identify paths through the application and repeat those paths, ‘self-healing’ where the paths fail due to changes, then they might add a few more ’tests’ for new functionality in the app.

I use AI for all of the above, but I create normal test execution coverage using Playwright, Selenium, Testing Library, API calls, Domain/Payload/Page objects, and maintain this as a normal part of development.

It is pretty easy to keep your automated execution coverage up to date and running clean if it is updated in sync with the application updates.

What I really want from the additional tooling is information. Insights related to the context of the work that is not covered by the automated execution coverage.

I hope to evaluate more commercial tools in the future because I hope they have a more optimized process than using a prompt.

But, I need to know the capabilities of the existing AI Coding tools prior to adopting or recommending any commercial tool so that I know the commercial offering has benefits.

Try it Yourself

Since this is just a /goal prompt it can be used on any AI Coding or CLI system.

They all have their own nuances, which hopefully you’ll know how to adapt the prompt for.

I use Codex, so this is built around a goal which is the current Codex syntax which is most ’loop’-like.

If your AI Tooling can:

Use Playwright MCP
Use Chrome Dev Tools MCP
Make HTTP calls
Interact with Git and Github

Then you’ll be able to use this concept or adapt the prompt to your needs.

I’m making all my results and experimentation public here:

https://github.com/eviltester/anywaydata-ai-testing

While I build and develop AnyWayData:

AI Interactive Web Application Review