The Solution

One agentic loop, from abend to closure.

Progull replaces manual incident triage with a multi-agent system that observes, reasons, acts and learns — under strict policy and full audit.

The problem

Batch abends still run on human reflexes.

Mainframe incident management is mostly people, pagers and PDFs. Tickets sit open while engineers extract logs, pattern-match error codes and decide what to restart.

Manual triage at 3am

Operators page subject-matter experts, hunt logs across SDSF and JES spool, and reassemble context before a single fix can begin.

Tribal knowledge bottlenecks

Recovery depends on a small set of veterans who know the JCL, the DB2 quirks and the historical workarounds.

Recovery measured in hours

Every minute of a stuck overnight batch ripples into SLAs, downstream apps and missed business cut-offs.

The four agents

A multi-agent system, each with a single job.

Each agent owns one phase of the lifecycle and hands off through a typed, audit-logged contract.

Detection agent

Subscribes to JES spool, SYSLOG and OPERLOG events. Classifies abend codes the moment they appear and opens a ServiceNow incident with first-line context.

Reasoning agent

Fuses SYSUDUMP, JCL, recent change events, DB2 SQLCODE and historical resolutions into an explainable root-cause narrative.

Remediation agent

Selects a policy-approved playbook — resubmit step, recycle CICS region, hold downstream job — and executes inside guardrails.

Closure agent

Validates recovery, attaches the full reasoning trail and evidence pack to the incident, and closes it in ServiceNow.

Guardrails

Autonomy with an off-switch on every step.

Guardrail

Policy-as-code

Every remediation maps to a versioned, reviewable playbook. Nothing executes outside the approved set.

Guardrail

Human-in-the-loop modes

Run agents in shadow, recommend, or auto-execute — per job class, per environment, per time window.

Guardrail

Full reasoning trace

Inputs, intermediate thoughts and chosen actions are persisted with the incident for audit and learning.

The reasoning loop

Observe → Reason → Act → Verify → Learn.

Every incident traverses the same typed loop. Each transition is logged, signed, and replayable against the original evidence pack.

  1. STEP 01

    Observe

    Subscribe to SYSLOG, JES spool and CICS events. Classify the abend within seconds.

  2. STEP 02

    Reason

    Fuse SYSUDUMP, JCL, SQLCODE and recent change context into an explainable hypothesis.

  3. STEP 03

    Act

    Select a policy-approved playbook. Execute under a named surrogate ID with pre-flight checks.

  4. STEP 04

    Verify

    Confirm RC=00, dataset shape and downstream invariants before declaring recovery.

  5. STEP 05

    Learn

    Promote new patterns into KB candidates. SMEs approve before they become playbooks.

Playbook anatomy

Every remediation is versioned, reviewable code.

A Progull playbook is not a free-form LLM instruction. It is a typed manifest your change board approves once and the agent executes the same way every time.

Trigger conditions

Abend code, job class, LPAR, time window and confidence threshold required to match.

Pre-flight checks

Dataset state, downstream holds, change-freeze windows and dependent jobs verified before any action.

Action set

An ordered list of typed primitives — submit, hold, release, recycle — under a named surrogate ID.

Post-conditions

RC, row counts, dump absence and CICS region health re-checked before declaring recovery.

Failure escalation

If any check fails the playbook stops, opens a Sev-2 worknote and pages the assignment group.

playbook · PB-MF-014APPROVED · v3
name: pb-mf-014-s0c7-payroll
match:
  abend: S0C7
  job_class: PAYROLL
  confidence_min: 0.85
preflight:
  - assert dataset(PAY.OUT).extents_remaining > 2
  - assert job(GLFEED01).state == WAITING
actions:
  - hold     job: GLFEED01
  - submit   job: PAYRUN02
             step: STEP0040
             parm: "CLEAN"
             surrogate_id: PROGULL.PROD.PAY
postcheck:
  - assert step.RC == 0
  - assert dataset(PAY.OUT).row_count > 100000
on_failure:
  escalate: assignment_group=MF-PAYROLL-SRE
Operating modes

Autonomy is a dial, not a switch.

Pick the mode per environment, per job class and per time window. Promote forward only when your operators trust the trail.

MODE 01

Shadow

Agents observe and produce a full reasoning trail. Zero action taken on z/OS. Best for week 1.

MODE 02

Recommend

Agents draft the worknote and remediation plan inside ServiceNow. Operator clicks execute.

MODE 03

Approve

Agents execute approved playbooks; named approver is paged for low-confidence or off-policy cases.

MODE 04

Autonomous

Agents detect, decide and act within policy. Humans review the trail; Sev-1s still page.

Failure modes

What happens when the agent is wrong.

Failure handling is designed in. The agent never silently retries; every escalation lands in ServiceNow with the same trail your auditor consumes.

Low-confidence diagnosis

Below the configured threshold, the agent halts at Recommend and pages the assignment group with the evidence pack — never executes.

Pre-flight check fails

The playbook stops on the first failed assertion, records which assertion failed and opens a Sev-2 worknote for human triage.

Post-condition fails

If RC, row counts or region health do not return as expected, the agent does not declare recovery and re-opens the incident with full diff.

Policy not matched

If no approved playbook matches, the agent writes a recommendation only. New playbooks always require a human change request.

Repeated abend pattern

After N occurrences of the same abend in a window, the agent stops auto-resolving and escalates for SME review — the playbook is no longer the answer.

Operator override

Operators can revoke an in-flight action via a single click in ServiceNow. The agent will not retry the same action without a fresh approval.