AI System for Software Testing and QA
Code coverage of 80% looks good until you examine what's actually covered: happy paths, obvious cases, but not boundary conditions, not integrations between components, not edge cases with unexpected data. The AI QA system solves not the problem "no tests exist" but "tests exist, but they don't catch what matters."
Components of AI Testing System
[Code Analysis] [Requirement Analysis]
AST parsing NLP from Jira/Confluence
↓ ↓
[Test Generation Engine]
Unit | Integration | E2E | API
↓
[Test Prioritization]
Change Impact Analysis → run necessary tests, not all
↓
[Result Analysis]
Failure Classification + Root Cause Suggestion
↓
[Coverage Intelligence]
Semantic gaps in coverage
AI Coverage Analysis: Finding Semantic Gaps
Traditional coverage (Istanbul, JaCoCo) counts lines. Problem: 100% line coverage doesn't mean all business scenarios are tested.
from langchain_openai import ChatOpenAI
import ast
import textwrap
class SemanticCoverageAnalyzer:
"""Analyzes semantic gaps in test coverage"""
ANALYSIS_PROMPT = """Analyze the function and existing tests.
Identify which business scenarios and boundary conditions are NOT covered.
Function:
```python
{function_code}
Existing tests:
{existing_tests}
Identify uncovered scenarios:
- Boundary values (empty string, None, 0, max int, negative)
- Parameter combinations
- Error scenarios (exceptions, invalid input)
- Concurrent access (if applicable)
- Business rules in conditions
For each: describe scenario + why it matters + possible bug if not tested. Return JSON: {{gaps: [{{scenario, importance, potential_bug}}]}}"""
def __init__(self):
self.llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
def analyze_function_coverage(
self,
function_source: str,
test_source: str
) -> list[dict]:
result = self.llm.invoke(
self.ANALYSIS_PROMPT.format(
function_code=function_source,
existing_tests=test_source
)
)
import json
return json.loads(result.content)["gaps"]
def extract_functions_from_module(self, source: str) -> list[dict]:
"""Extracts functions from Python module via AST"""
tree = ast.parse(source)
functions = []
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
func_source = ast.get_source_segment(source, node)
complexity = self._calculate_cyclomatic_complexity(node)
functions.append({
"name": node.name,
"source": func_source,
"complexity": complexity,
"line_start": node.lineno
})
return sorted(functions, key=lambda x: x["complexity"], reverse=True)
def _calculate_cyclomatic_complexity(self, node) -> int:
"""Cyclomatic complexity — test prioritization metric"""
complexity = 1
for child in ast.walk(node):
if isinstance(child, (ast.If, ast.While, ast.For, ast.ExceptHandler,
ast.With, ast.Assert)):
complexity += 1
elif isinstance(child, ast.BoolOp):
complexity += len(child.values) - 1
return complexity
### Test Generator with Mutation Testing
```python
class AITestGenerator:
UNIT_TEST_PROMPT = """Generate pytest unit tests for the function.
Function:
{function_code}
Uncovered scenarios (focus on these):
{gaps}
Requirements:
- Use pytest + pytest-mock
- Parametrize via @pytest.mark.parametrize where applicable
- For each test: Arrange-Act-Assert
- Boundary value tests
- Invalid input tests
- Mocks for external dependencies
Return only code, no explanations."""
async def generate_unit_tests(
self,
function_source: str,
gaps: list[dict]
) -> str:
gaps_text = "\n".join([
f"- {g['scenario']}: {g['importance']}"
for g in gaps[:5] # top 5 by importance
])
result = await self.llm.ainvoke(
self.UNIT_TEST_PROMPT.format(
function_code=function_source,
gaps=gaps_text
)
)
return result.content
async def run_mutation_testing(self, source_file: str, test_file: str) -> dict:
"""Runs mutation testing via mutmut"""
import subprocess
result = subprocess.run(
["mutmut", "run", f"--paths-to-mutate={source_file}",
f"--tests-dir={test_file}"],
capture_output=True, text=True
)
# Analyze survived mutants (tests didn't catch the change)
survived = self._parse_survived_mutants(result.stdout)
if survived:
additional_tests = await self._generate_for_mutants(survived, source_file)
return {"survived_count": len(survived), "additional_tests": additional_tests}
return {"survived_count": 0, "mutation_score": "100%"}
CI/CD Integration
# .github/workflows/ai-qa.yml
name: AI QA Analysis
on:
pull_request:
types: [opened, synchronize]
jobs:
ai-test-analysis:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # needed for diff
- name: Analyze changed files
run: |
git diff origin/main...HEAD --name-only --diff-filter=AM | \
grep "\.py$" > changed_files.txt
- name: Run AI coverage analysis
run: |
python qa_system/analyze_coverage.py \
--changed-files changed_files.txt \
--generate-missing-tests \
--output coverage_report.json
- name: Comment PR with AI findings
uses: actions/github-script@v7
with:
script: |
const report = require('./coverage_report.json')
const comment = formatReport(report)
github.rest.issues.createComment({
issue_number: context.issue.number,
body: comment
})
Case study: Python backend service (FastAPI), 45,000 lines of code, 380 tests. Coverage: 74%. AI analysis identified 89 semantic gaps (scenario-level, not line-level), of which 34 were high-priority. Generated 67 additional tests. When run: 8 of 67 tests failed — found real bugs in boundary condition handling (None in aggregation, negative quantities in orders, empty list on sorting).
Timeframe
- Coverage analysis + unit test generation: 3–4 weeks
- Full QA system with CI/CD integration: 8–10 weeks
- Mutation testing and E2E: +2–3 weeks







