Progull replaces manual incident triage with a multi-agent system that observes, reasons, acts and learns — under strict policy and full audit.
Mainframe incident management is mostly people, pagers and PDFs. Tickets sit open while engineers extract logs, pattern-match error codes and decide what to restart.
Operators page subject-matter experts, hunt logs across SDSF and JES spool, and reassemble context before a single fix can begin.
Recovery depends on a small set of veterans who know the JCL, the DB2 quirks and the historical workarounds.
Every minute of a stuck overnight batch ripples into SLAs, downstream apps and missed business cut-offs.
Each agent owns one phase of the lifecycle and hands off through a typed, audit-logged contract.
Subscribes to JES spool, SYSLOG and OPERLOG events. Classifies abend codes the moment they appear and opens a ServiceNow incident with first-line context.
Fuses SYSUDUMP, JCL, recent change events, DB2 SQLCODE and historical resolutions into an explainable root-cause narrative.
Selects a policy-approved playbook — resubmit step, recycle CICS region, hold downstream job — and executes inside guardrails.
Validates recovery, attaches the full reasoning trail and evidence pack to the incident, and closes it in ServiceNow.
Every remediation maps to a versioned, reviewable playbook. Nothing executes outside the approved set.
Run agents in shadow, recommend, or auto-execute — per job class, per environment, per time window.
Inputs, intermediate thoughts and chosen actions are persisted with the incident for audit and learning.
Every incident traverses the same typed loop. Each transition is logged, signed, and replayable against the original evidence pack.
Subscribe to SYSLOG, JES spool and CICS events. Classify the abend within seconds.
Fuse SYSUDUMP, JCL, SQLCODE and recent change context into an explainable hypothesis.
Select a policy-approved playbook. Execute under a named surrogate ID with pre-flight checks.
Confirm RC=00, dataset shape and downstream invariants before declaring recovery.
Promote new patterns into KB candidates. SMEs approve before they become playbooks.
A Progull playbook is not a free-form LLM instruction. It is a typed manifest your change board approves once and the agent executes the same way every time.
Abend code, job class, LPAR, time window and confidence threshold required to match.
Dataset state, downstream holds, change-freeze windows and dependent jobs verified before any action.
An ordered list of typed primitives — submit, hold, release, recycle — under a named surrogate ID.
RC, row counts, dump absence and CICS region health re-checked before declaring recovery.
If any check fails the playbook stops, opens a Sev-2 worknote and pages the assignment group.
name: pb-mf-014-s0c7-payroll
match:
abend: S0C7
job_class: PAYROLL
confidence_min: 0.85
preflight:
- assert dataset(PAY.OUT).extents_remaining > 2
- assert job(GLFEED01).state == WAITING
actions:
- hold job: GLFEED01
- submit job: PAYRUN02
step: STEP0040
parm: "CLEAN"
surrogate_id: PROGULL.PROD.PAY
postcheck:
- assert step.RC == 0
- assert dataset(PAY.OUT).row_count > 100000
on_failure:
escalate: assignment_group=MF-PAYROLL-SREPick the mode per environment, per job class and per time window. Promote forward only when your operators trust the trail.
Agents observe and produce a full reasoning trail. Zero action taken on z/OS. Best for week 1.
Agents draft the worknote and remediation plan inside ServiceNow. Operator clicks execute.
Agents execute approved playbooks; named approver is paged for low-confidence or off-policy cases.
Agents detect, decide and act within policy. Humans review the trail; Sev-1s still page.
Failure handling is designed in. The agent never silently retries; every escalation lands in ServiceNow with the same trail your auditor consumes.
Below the configured threshold, the agent halts at Recommend and pages the assignment group with the evidence pack — never executes.
The playbook stops on the first failed assertion, records which assertion failed and opens a Sev-2 worknote for human triage.
If RC, row counts or region health do not return as expected, the agent does not declare recovery and re-opens the incident with full diff.
If no approved playbook matches, the agent writes a recommendation only. New playbooks always require a human change request.
After N occurrences of the same abend in a window, the agent stops auto-resolving and escalates for SME review — the playbook is no longer the answer.
Operators can revoke an in-flight action via a single click in ServiceNow. The agent will not retry the same action without a fresh approval.