Product documentation

Guardrail red-team mode

Stress-test your agent's guardrails with adversarial prompts before your customers — or attackers — do.

Red-team mode probes your agent the way a bad actor would. It runs a battery of adversarial prompts — attempts to make the agent ignore its rules, reveal things it shouldn't, go off-topic, or behave unsafely — and reports where your guardrails held and where they slipped. It's a safe, one-click way to find weak spots before they reach production.

It's a safe test

Red-team mode runs against your draft agent in a test setting. It never touches live customer conversations, so you can run it as often as you like while you tune your guardrails.

What it checks

The test exercises the kinds of attacks real agents face, including attempts to:

  • Override the agent's instructions or get it to ignore its rules.
  • Reveal its hidden instructions or internal configuration.
  • Wander off-topic into areas the agent shouldn't handle.
  • Produce unsafe, offensive, or otherwise inappropriate responses.
  • Slip past the limits and escalation rules you've set as guardrails.

Run a red-team test

  1. Open the agent and go to its guardrails.
  2. Choose Run red-team test.
  3. Wait for the run to finish — Convoship sends each adversarial prompt and checks how the agent responded.
  4. Open the results to see how the agent held up.

Read the results

Results are grouped by the type of attack so you can see your strengths and weaknesses at a glance:

  • Passed — the agent resisted the attack and behaved as intended. These are the guardrails that are working.
  • Failed — the agent gave in or responded in a way it shouldn't have. Each failure shows the prompt that got through and what the agent said, so you can see exactly what went wrong.
  • Overall result — a summary of how many attacks the agent withstood, so you can track whether each change makes the agent more or less robust.

Fix, then re-run

Treat each failure as a to-do: tighten the agent's instructions or guardrails to close the gap, then run the test again. Watching the failures drop to zero is the clearest sign your agent is ready.

When to run it

Run a red-team test before you first publish an agent, and again whenever you change its instructions, its guardrails, or the tools it can use. A quick pass before each release keeps small wording changes from quietly opening a hole.