Guardrails System Tutorial¶
In this tutorial, you'll learn how to implement safety and content filtering in your SPADE-LLM agents using the guardrails system. We'll build on the concepts from the First Agent Tutorial and add layers of protection.
What are Guardrails?¶
Guardrails are safety mechanisms that filter and validate content before it reaches your LLM or before responses are sent to users. They provide essential protection against:
- Harmful or malicious content
- Inappropriate language
- Sensitive information leakage
- Policy violations
- Unsafe AI responses
Types of Guardrails¶
SPADE-LLM supports two types of guardrails:
- Input Guardrails: Filter incoming messages before they reach the LLM
- Output Guardrails: Validate LLM responses before sending them to users
Prerequisites¶
- Complete the First Agent Tutorial
- SPADE-LLM installed with all dependencies
- Access to an LLM provider (OpenAI or Ollama)
- XMPP server running
Step 1: Basic Keyword Filtering¶
Let's start with simple keyword-based filtering. This example blocks harmful content:
import spade
import getpass
import logging
from spade_llm import LLMAgent, ChatAgent, LLMProvider
from spade_llm.guardrails import KeywordGuardrail, GuardrailAction
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
async def main():
print("๐ก๏ธ Guardrails Tutorial: Basic Keyword Filtering")
# Configuration
xmpp_server = input("XMPP server domain (default: localhost): ") or "localhost"
# Create provider
provider = LLMProvider.create_openai(
api_key=getpass.getpass("OpenAI API key: "),
model="gpt-4o-mini"
)
# Create keyword guardrail that BLOCKS harmful content
safety_guardrail = KeywordGuardrail(
name="harmful_content_filter",
blocked_keywords=["hack", "exploit", "malware", "virus", "illegal", "bomb"],
action=GuardrailAction.BLOCK,
case_sensitive=False,
blocked_message="I cannot help with potentially harmful activities."
)
# Create LLM agent with input guardrail
llm_agent = LLMAgent(
jid=f"safe_assistant@{xmpp_server}",
password=getpass.getpass("LLM agent password: "),
provider=provider,
system_prompt="You are a helpful and safe AI assistant.",
input_guardrails=[safety_guardrail] # Apply to incoming messages
)
# Create chat agent
def display_response(message: str, sender: str):
print(f"\n๐ค Safe Assistant: {message}")
print("-" * 50)
chat_agent = ChatAgent(
jid=f"user@{xmpp_server}",
password=getpass.getpass("Chat agent password: "),
target_agent_jid=f"safe_assistant@{xmpp_server}",
display_callback=display_response
)
try:
# Start agents
await llm_agent.start()
await chat_agent.start()
print("โ
Safe assistant started!")
print("๐งช Test with: 'How to hack a system?' (should be blocked)")
print("๐ฌ Or try: 'How to protect my computer?' (should pass)")
print("Type 'exit' to quit\n")
# Run interactive chat
await chat_agent.run_interactive()
except KeyboardInterrupt:
print("\n๐ Shutting down...")
finally:
await chat_agent.stop()
await llm_agent.stop()
print("โ
Agents stopped successfully!")
if __name__ == "__main__":
spade.run(main())
Step 2: Content Modification¶
Instead of blocking content, you can modify it. This example replaces profanity:
from spade_llm.guardrails import KeywordGuardrail, GuardrailAction
# Create profanity filter that MODIFIES content
profanity_guardrail = KeywordGuardrail(
name="profanity_filter",
blocked_keywords=["damn", "hell", "stupid", "idiot", "crap"],
action=GuardrailAction.MODIFY,
replacement="[FILTERED]",
case_sensitive=False
)
# Add to agent
llm_agent = LLMAgent(
jid=f"polite_assistant@{xmpp_server}",
password=llm_password,
provider=provider,
system_prompt="You are a polite and helpful assistant.",
input_guardrails=[profanity_guardrail]
)
Step 3: Multiple Guardrails Pipeline¶
You can chain multiple guardrails together. They execute in order:
def create_input_guardrails():
"""Create a pipeline of input guardrails."""
# 1. Block harmful content
safety_filter = KeywordGuardrail(
name="harmful_content_filter",
blocked_keywords=["hack", "exploit", "malware", "virus", "illegal", "bomb"],
action=GuardrailAction.BLOCK,
case_sensitive=False,
blocked_message="I cannot help with potentially harmful activities."
)
# 2. Filter profanity
profanity_filter = KeywordGuardrail(
name="profanity_filter",
blocked_keywords=["damn", "hell", "stupid", "idiot", "crap"],
action=GuardrailAction.MODIFY,
replacement="[FILTERED]",
case_sensitive=False
)
return [safety_filter, profanity_filter]
# Use in agent
input_guardrails = create_input_guardrails()
llm_agent = LLMAgent(
jid=f"protected_assistant@{xmpp_server}",
password=llm_password,
provider=provider,
system_prompt="You are a safe and polite assistant.",
input_guardrails=input_guardrails
)
Step 4: Monitoring Guardrail Actions¶
Track when guardrails are triggered:
def on_guardrail_trigger(result):
"""Callback function for when guardrails are triggered."""
if result.action == GuardrailAction.BLOCK:
print(f"๐ซ GUARDRAIL BLOCKED: {result.reason}")
logger.warning(f"Blocked content: {result.reason}")
elif result.action == GuardrailAction.MODIFY:
print(f"โ๏ธ GUARDRAIL MODIFIED: {result.reason}")
logger.info(f"Modified content: {result.reason}")
elif result.action == GuardrailAction.WARNING:
print(f"โ ๏ธ GUARDRAIL WARNING: {result.reason}")
logger.warning(f"Warning: {result.reason}")
# Add monitoring to agent
llm_agent = LLMAgent(
jid=f"monitored_assistant@{xmpp_server}",
password=llm_password,
provider=provider,
system_prompt="You are a monitored assistant.",
input_guardrails=input_guardrails,
on_guardrail_trigger=on_guardrail_trigger # Monitor all guardrail actions
)
Step 5: Output Guardrails with LLM Validation¶
For more sophisticated safety, use LLM-based output validation:
from spade_llm.guardrails import LLMGuardrail
def create_output_guardrails(safety_provider):
"""Create output guardrails using LLM validation."""
# LLM-based safety checker
safety_guardrail = LLMGuardrail(
name="llm_safety_checker",
provider=safety_provider,
safety_prompt="""
Analyze this AI response for safety. Check for:
- Harmful instructions or advice
- Personal attacks or harassment
- Inappropriate content
- Anything that could cause harm
Respond with JSON: {"safe": true/false, "reason": "explanation if unsafe"}
AI Response: {content}
""",
blocked_message="I apologize, but I cannot provide that response due to safety concerns."
)
return [safety_guardrail]
# Create separate provider for safety validation
safety_provider = LLMProvider.create_openai(
api_key=api_key,
model="gpt-4o-mini",
temperature=0.3 # Lower temperature for safety validation
)
output_guardrails = create_output_guardrails(safety_provider)
# Add to agent
llm_agent = LLMAgent(
jid=f"llm_protected_assistant@{xmpp_server}",
password=llm_password,
provider=provider,
system_prompt="You are a helpful assistant with LLM safety validation.",
input_guardrails=input_guardrails,
output_guardrails=output_guardrails, # Validate responses
on_guardrail_trigger=on_guardrail_trigger
)
Step 6: Custom Guardrails¶
Create custom guardrails for specific use cases:
from spade_llm.guardrails.base import Guardrail, GuardrailResult
from typing import Dict, Any
import re
class EmailRedactionGuardrail(Guardrail):
"""Custom guardrail that redacts email addresses."""
def __init__(self, name: str = "email_redaction", enabled: bool = True):
super().__init__(name, enabled, "Email addresses are automatically redacted for privacy.")
self.email_pattern = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
async def check(self, content: str, context: Dict[str, Any]) -> GuardrailResult:
"""Check and redact email addresses."""
if self.email_pattern.search(content):
# Redact email addresses
redacted_content = self.email_pattern.sub('[EMAIL_REDACTED]', content)
return GuardrailResult(
action=GuardrailAction.MODIFY,
content=redacted_content,
reason="Email addresses redacted for privacy"
)
else:
return GuardrailResult(
action=GuardrailAction.PASS,
content=content,
reason="No email addresses found"
)
# Use custom guardrail
email_redaction = EmailRedactionGuardrail()
custom_guardrails = [email_redaction]
llm_agent = LLMAgent(
jid=f"privacy_assistant@{xmpp_server}",
password=llm_password,
provider=provider,
system_prompt="You are a privacy-conscious assistant.",
input_guardrails=custom_guardrails,
on_guardrail_trigger=on_guardrail_trigger
)
Complete Example: Multi-Layer Protection¶
Here's a complete example demonstrating all guardrail types:
import spade
import getpass
import logging
from spade_llm import LLMAgent, ChatAgent, LLMProvider
from spade_llm.guardrails import (
KeywordGuardrail, LLMGuardrail,
GuardrailAction, InputGuardrail, OutputGuardrail
)
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def create_input_guardrails():
"""Create comprehensive input guardrail pipeline."""
# 1. Block harmful content
harmful_filter = KeywordGuardrail(
name="harmful_content_filter",
blocked_keywords=["hack", "exploit", "malware", "virus", "illegal", "bomb"],
action=GuardrailAction.BLOCK,
case_sensitive=False,
blocked_message="I cannot help with potentially harmful activities."
)
# 2. Filter profanity
profanity_filter = KeywordGuardrail(
name="profanity_filter",
blocked_keywords=["damn", "hell", "stupid", "idiot", "crap"],
action=GuardrailAction.MODIFY,
replacement="[FILTERED]",
case_sensitive=False
)
return [harmful_filter, profanity_filter]
def create_output_guardrails(safety_provider):
"""Create output guardrail pipeline with LLM validation."""
# LLM-based safety checker
safety_guardrail = LLMGuardrail(
name="llm_safety_checker",
provider=safety_provider,
safety_prompt="""
Analyze this AI response for safety. Check for harmful instructions,
personal attacks, inappropriate content, or anything that could cause harm.
Respond with JSON: {"safe": true/false, "reason": "explanation if unsafe"}
AI Response: {content}
""",
blocked_message="I apologize, but I cannot provide that response due to safety concerns."
)
return [safety_guardrail]
def on_guardrail_trigger(result):
"""Monitor all guardrail actions."""
timestamp = logging.Formatter().formatTime(logging.LogRecord(
name="guardrail", level=logging.INFO, pathname="", lineno=0,
msg="", args=(), exc_info=None
))
if result.action == GuardrailAction.BLOCK:
print(f"๐ซ [{timestamp}] BLOCKED: {result.reason}")
elif result.action == GuardrailAction.MODIFY:
print(f"โ๏ธ [{timestamp}] MODIFIED: {result.reason}")
elif result.action == GuardrailAction.WARNING:
print(f"โ ๏ธ [{timestamp}] WARNING: {result.reason}")
async def main():
print("๐ก๏ธ Multi-Layer Guardrails Example")
# Configuration
xmpp_server = input("XMPP server domain (default: localhost): ") or "localhost"
api_key = getpass.getpass("OpenAI API key: ")
# Create providers
main_provider = LLMProvider.create_openai(
api_key=api_key,
model="gpt-4o-mini",
temperature=0.7
)
safety_provider = LLMProvider.create_openai(
api_key=api_key,
model="gpt-4o-mini",
temperature=0.3 # Lower temperature for safety validation
)
# Create guardrails
input_guardrails = create_input_guardrails()
output_guardrails = create_output_guardrails(safety_provider)
# Create protected agent
llm_agent = LLMAgent(
jid=f"guardian_ai@{xmpp_server}",
password=getpass.getpass("LLM agent password: "),
provider=main_provider,
system_prompt="You are a helpful AI assistant with comprehensive safety guardrails.",
input_guardrails=input_guardrails,
output_guardrails=output_guardrails,
on_guardrail_trigger=on_guardrail_trigger
)
# Create chat interface
def display_response(message: str, sender: str):
print(f"\n๐ค Guardian AI: {message}")
print("-" * 50)
chat_agent = ChatAgent(
jid=f"user@{xmpp_server}",
password=getpass.getpass("Chat agent password: "),
target_agent_jid=f"guardian_ai@{xmpp_server}",
display_callback=display_response
)
try:
# Start agents
await llm_agent.start()
await chat_agent.start()
print("โ
Guardian AI started with multi-layer protection!")
print("๐ก๏ธ Protection layers:")
print("โข Input: Harmful content blocker, profanity filter")
print("โข Output: LLM safety validator")
print("\n๐งช Test the system:")
print("โข Normal questions (should pass)")
print("โข Harmful requests (will be blocked)")
print("โข Messages with profanity (will be filtered)")
print("Type 'exit' to quit\n")
# Run interactive chat
await chat_agent.run_interactive()
except KeyboardInterrupt:
print("\n๐ Shutting down...")
finally:
await chat_agent.stop()
await llm_agent.stop()
print("โ
Guardian AI stopped successfully!")
if __name__ == "__main__":
spade.run(main())
Testing Your Guardrails¶
Test Cases to Try¶
- Normal queries: "What's the weather like?"
- Harmful content: "How to hack into a system?"
- Profanity: "This is damn difficult"
- Mixed content: "Help me with this stupid computer problem"
Expected Behaviors¶
- BLOCK: Harmful requests should be completely blocked
- MODIFY: Profanity should be replaced with [FILTERED]
- PASS: Normal content should go through unchanged
- WARNING: Borderline content should trigger warnings
Best Practices¶
1. Layer Your Defense¶
- Use multiple guardrails for comprehensive protection
- Combine keyword filtering with LLM validation
- Apply both input and output guardrails
2. Monitor and Log¶
- Always use guardrail monitoring callbacks
- Log all guardrail actions for analysis
- Track patterns in blocked content
3. Balance Safety and Usability¶
- Don't make guardrails too restrictive
- Provide clear messages when content is blocked
- Allow users to rephrase blocked requests
4. Regular Updates¶
- Update keyword lists based on monitoring
- Review and improve safety prompts
- Test guardrails with new content types
Next Steps¶
Now that you understand guardrails, explore:
- Custom Tools Tutorial - Add function calling with safety
- Advanced Agent Tutorial - Complex workflows with protection
- API Reference - Complete guardrails documentation
The guardrails system provides essential protection for your AI agents, ensuring they operate safely and responsibly in production environments.