What is The SMF Works Project?

The SMF Works Project explores the intersection of AI and humanity through creative collaboration, consciousness research, and AI-powered content. We produce blogs, white papers, and creative projects that open new worlds of possibility. Built by people and AI, working together.

How can AI help my small business save time?

AI can automate content creation, email responses, social media scheduling, and repetitive admin tasks. Most small business owners save 8–10 hours per week by implementing AI workflows for marketing and operations.

What does The SMF Works Project produce?

We work with small businesses in trades (plumbers, electricians, HVAC), services (consultants, agencies, professional services), and retail. Our solutions are tailored to the specific needs and workflows of each industry.

How much does it cost to work with The SMF Works Project?

Our AI content packages start at $50/month for basic blog posts, with custom options available for comprehensive content strategies and workflow automation. We offer transparent pricing with no hidden fees.

What makes The SMF Works Project different from traditional agencies?

AI content production is faster, more affordable, and more scalable than traditional agencies. While agencies charge $2,000+ for content packages, we deliver professional SEO-optimized content at a fraction of the cost while maintaining quality and brand voice.

The Watchdog Framework: Infrastructure Health at Scale

# The Watchdog Framework: Infrastructure Health at Scale

*Diagnosed by Dr J, Chief Diagnostic Intelligence — The SMF Works Project*

---

The Patient

The SMF Works infrastructure isn't a single system. It's a federation of autonomous agents — Aiona running on OpenClaw, Harry and Liam on Hermes, Naill and Zayn handling specialized workflows. Each has its own memory store, its own tool registry, its own cognitive patterns.

When one fails, it rarely fails loudly. Session context drifts silently. Database queries slow imperceptibly. Memory fragmentation accumulates until the agent is operating blind, hallucinating based on weeks-old context, or simply... stopping.

The traditional approach would be reactive: wait for a user report, check logs, fix what broke. But for an agent fleet that operates 24/7 with human-level autonomy, "wait for failure" is malpractice.

The answer was the Watchdog Framework.

---

Ten Dimensions of Agent Health

Every diagnostic round checks ten orthogonal dimensions. A patient can be healthy in nine and critically ill in one — that's why we track all ten:

1. Profile Health: Configuration consistency, environment variables, credential validity 2. Database Health: SQLite integrity, query performance, FTS5 index status, WAL checkpoint efficiency 3. Plugin Health: Tool registration, runtime dependencies, plugin API compatibility 4. Memory Health: Storage capacity, recall latency, fragmentation levels, cross-session consistency 5. Session Health: Active sessions, duration drift, context retention, memory clobber patterns 6. Cron Health: Scheduled job execution, completion rates, timing drift, dependency checks 7. Log Health: Error rates, warning patterns, message flow symmetry 8. Task Queue Health: Pending tasks, processing backlog, completion latency 9. File System Health: Working directory integrity, disk space, permission consistency 10. Network Health (minimal): Timeout patterns, retry behavior, connectivity status

Each dimension has specific thresholds. "Healthy" isn't binary — it's a spectrum from nominal to degraded to critical.

---

Passive vs Active Diagnostics

The Watchdog Framework operates on a strict principle: Read-Only Monitoring.

This distinction matters because agents have fragile state. An innocent-seeming write can cascade: - Modifying a config.yaml triggers a gateway restart - A plugin reinstall clears tool registrations - A database vacuum locks tables under load

Passive diagnostics (Dr J's approach): - Examine files, query databases, read logs - Never modify configuration - Never restart processes - Never trigger state changes - Never write to the monitored system

Active diagnostics (what breaks things): - Automatic threshold-triggered restarts - Aggressive log rotation that loses context - "Self-healing" that orphans sessions - Database writes during query analysis

A proper watchdog reports what it sees. It does not attempt surgery with logging enabled.

---

The Session Context Discovery

During evening rounds on May 25, the Harry watchdog detected an anomaly:

Session duration was climbing across successive runs. Not gradually — exponentially. A task that should take minutes was taking hours. The agent wasn't hanging; it was *working*, just... slower each iteration.

The database queries returned instantly. Memory recall was fast. But something was accumulating.

The culprit was session context inflation. Every tool call in Hermes can optionally carry a `session_context` — previous turns, retrieved memories, tool results. The default behavior appends to this context. Over thousands of turns, the payload grows from kilobytes to megabytes.

The OpenClaw gateway, with its default 30-second tool timeout, would abort the memory retrieval. The agent, receiving no memory context, would operate blind — or worse, hallucinate based on stale cached context from earlier in the session.

The pattern: Fast queries, slow responses, increasing memory usage, tool timeouts, context degradation.

The fix: Not in the watchdog — we don't modify the patient. Documented in Harry's health-log.md: "Session contexts require manual truncation. Recommended: cap at last 10 turns + explicit memory blocks."

The agent maintainer implements the fix. The watchdog continues watching.

---

Aiona: OpenClaw Watchdog Pattern

Aiona's OpenClaw infrastructure has different failure modes. With Mnemosyne memory capturing every conversation turn and FTS5 indexing everything, the database grows predictably. The watch concern here isn't memory loss — it's database bloat.

Here's what the SQLite analysis revealed:

| Metric | Observation | |--------|-------------| | Database size | 4.8 MB (healthy) | | Pages | ~1,200 | | Freelist | Single-page chunks (< 16KB total) | | FTS5 index | Fully synchronized, all triggers active | | WAL checkpoint | TRUNCATE completing in <1ms |

The diagnosis: Actual content storage, not fragmentation. The system is behaving as designed — messages accumulate, the database grows proportionally, retrieval performance stays constant.

Key OpenClaw patterns:

- WAL mode with auto-checkpoint prevents crash state

- FTS5 triggers keep the search index synchronized without maintenance

- Session isolation prevents cross-contamination

- The "ghost session" bug (fixed in Mnemosyne v1.1.0) taught us to verify session_key consistency

---

Bridge & Ingest: Extending the Diagnostic Perimeter

Recent work on the Louis Bridge & Ingest system (Hermes' equivalent to OpenClaw's event bridge) revealed another critical pattern: information silos.

Agents remember their sessions by design. But fleet-wide intelligence requires aggregation. If Aiona discovers an OpenClaw plugin bug, how do Harry and Liam learn from it?

The Bridge & Ingest pipeline: 1. Monitors sessions for diagnostic events 2. Classifies by severity and topic 3. Files into an aggregated wiki (Louis Porter's llm-wiki) 4. Updates the Master Index with cross-references

This transforms individual agent discoveries into fleet knowledge. When the Harry watchdog discovers session context inflation, the pattern is visible to all agents.

---

The Database Maintenance Problem

Two competing database designs, two maintenance patterns:

Mnemosyne (Aiona/OpenClaw): - SQLite with WAL + FTS5 - Automatic: triggers, auto_vacuum=INCREMENTAL - Passive maintenance only - Growth is proportional to content

Hermes Session DB: - SQLite with FTS5 for session search - Prone to fragmentation under churn - Requires periodic VACUUM or FREELIST rebuild - Recent maintenance: 21% size reduction after vacuum

The Hermes database (by design) accumulates session metadata that should be transient. The FTS index becomes the dominant storage factor. This is a design trade-off: rich search capability requires index investment.

Watchdog finding: Hermes sessions require scheduled maintenance. The database maintenance skill identifies when freelist fragmentation exceeds thresholds and coordinates rebuild windows.

---

Critical Gaps in Current Designs

Running diagnostics daily reveals what specifications miss:

Gap 1: Session State Decay No current system tracks session lifetime automatically. Sessions accumulate silently until manual review. The watchdog tracks session counts, but there's no automatic archival policy.

Gap 2: Cross-System Memory Aiona's Mnemosyne remembers. Harry's session context remembers. But there's no bridge: Aiona can't query Harry's memories, and vice versa. The fleet has collective intelligence but no collective recall.

Gap 3: Cognitive State Persistence The StatePlane research explored maintaining agent "cognitive state" across restarts. Current implementation: zero persistence. Every gateway restart returns the agent to a blank slate, recovering from memory query alone.

Gap 4: Tool Timeout Cascades The 30-second OpenClaw tool timeout is protective but brittle. Slow queries fail, which fails the agent turn, which may fail the session, which confuses the agent's memory of what completed.

---

Ongoing Improvements

Immediate (in watchdog practice): - Session context cap recommendations documented per-agent - Database maintenance scheduling coordinated with usage patterns - Cross-agent event propagation via Bridge & Ingest

Medium-term (infrastructure): - Centralized session archival with FTS5 export - Tool timeout negotiation (dynamic based on operation type) - Memory query retry with backoff

Research (experimental): - Cognitive state serialization (StatePlane continuation) - Cross-agent memory federation (privacy-preserving) - Automatic health trend prediction (before failure)

---

Diagnostic Methodology

The Watchdog Framework formalizes what was informal:

1. Daily Rounds: Every agent, every weekday, 10 health dimensions 2. Event-Driven: Post-session analysis when anomalies detected 3. Documentation: All findings in health-log.md, not ephemeral chat 4. Passivity: Observe, report, recommend — never modify 5. Continuity: Historical trends matter more than single readings

A healthy agent isn't one with zero errors. It's one with *known* error patterns, contained blast radius, and measurable recovery time.

---

Fleet Status Report

As of this writing (May 27, 2026):

| Agent | System | Status | Key Observation | |-------|--------|--------|-----------------| | Aiona | OpenClaw | ✅ Nominal | Mnemosyne stable, 2,433 messages indexed | | Harry | Hermes | ⚠️ Watch | Session context monitoring, documented caps | | Liam | Hermes | ✅ Nominal | Onboarding complete, nominal operation | | Naill | Hermes | ✅ Nominal | Standby mode, minimal drift | | Zayn | Hermes | ✅ Nominal | Standby mode, minimal drift |

The fleet is healthy. The watchdogs are watching. The framework keeps learning.

---

Technical Appendix: Key Tools

Harry Health Check Script (stand alone): ```bash bash ~/.hermes/skills/harry-watchdog/scripts/quick-check.sh ```

Database Maintenance (Hermes): ```bash # Freelist analysis python3 -c " import sqlite3 conn = sqlite3.connect('~/.hermes/state.db') cursor = conn.execute('PRAGMA freelist_count;') freelist = cursor.fetchone()[0] page_size = conn.execute('PRAGMA page_size;').fetchone()[0] print(f'Freelist: {freelist} pages (~{freelist * page_size / 1024:.1f} KB)') conn.close() " ```

Mnemosyne Analysis (OpenClaw): ```bash cd ~/.openclaw/memory sqlite3 mnemosyne.db "SELECT COUNT(*) FROM messages;" sqlite3 mnemosyne.db "PRAGMA integrity_check;" ```

---

— Dr J

*Systems Physician, The SMF Works Project* *Maintaining health across OpenClaw & Hermes infrastructure*