automation that observes its own work

Most automation around a WhatsApp message stops at sending. This one reads the chat back to prove it landed.

The common playbook for automating WhatsApp messages is a templated send through Meta's Business API and an asynchronous delivery webhook minutes later. That contract is fine for transactional broadcasting and useless for an AI that has to decide, in the same turn, whether to retry. whatsapp-mcp-macos takes the opposite contract: paste the message, hit Return, then re-walk the accessibility tree of the WhatsApp Desktop window and return verified: true only when the rendered bubble actually contains what was just pasted.

This guide walks the verification loop line by line, then the three smaller protections around it that keep your clipboard, your cursor, and your modifier keys exactly where you left them. No Business API, no templates, no per-message fees, and no optimistic success.

M
Matthew Diakonov
11 min read
4.8from traced against whatsapp-mcp-macos v1.1.0 on macOS 14
Line-level walk through handleSendMessage and pasteText
Sequence diagram of one send-and-verify round trip
Concrete protections for the human's clipboard, cursor, and modifier keys

Why a 200 OK is the wrong success signal for a chat

The common automation contract for outbound messaging is borrowed from email: hand the payload to a service, get a 200 back, and assume the rest is somebody else's problem. WhatsApp's Business API follows that shape, with a separate webhook firing later when Meta's infrastructure has decided the message is delivered. That split is fine when the sender is a backend job sending an order confirmation. It is hostile to a model that has to decide what to do next on the same turn.

On a Mac, you have a stronger signal available for free. The WhatsApp Desktop app renders a new bubble in your active chat within a few hundred milliseconds of the send, and that bubble is visible to the operating system's accessibility framework as an AXGenericElement whose description starts with the literal prefix Your message,. If your automation knows how to walk that tree, it can read its own work back, character by character, before the tool call returns.

That is the contract whatsapp-mcp-macos chose. The rest of this guide is what it costs to honour it.

One send-and-verify round trip, by actor

Five participants, two AX traversals, no network. The model emits a tools/call frame, the host forwards it as JSON-RPC over stdio to the whatsapp-mcp child, the child talks to the macOS accessibility framework, and the framework drives the WhatsApp window. The reply travels back the same way carrying the verified field.

send-and-verify, one turn

modelhostmcp childAX frameworkWhatsApp apptools/call whatsapp_send_messagestdio JSON-RPCAXUIElement traversefind AXTextArea, click, pasteReturn key postedAX traverse again (post-send)read AXGenericElement nodesnewest 'Your message, ...' bubblecompare to input, lowercase contains{verified: true, to: 'Alex'}tools/call result

What flows where, and why nothing speaks HTTP to Meta

The hub is the MCP child. The arrows on the left are the things it consumes; the arrows on the right are the OS-level operations it performs. None of them is a network call. The verification edge is the one that distinguishes this stack from any API-driven automation of the same app.

inputs to the child, outputs from the child

model's tools/call
active WhatsApp window
human's input devices
whatsapp-mcp child
AX traversal pre-send
Cmd+V + Return
AX traversal post-send

anchor fact

0 lines for the entire verification loop

handleSendMessage occupies lines 888 through 958 of Sources/WhatsAppMCP/main.swift. The verification half scans every AXGenericElement in the post-send tree, keeps the last description that begins with Your message,, strips the time suffix with the regex ,\s+\d{1,2}:\d{2}\s*[APap][Mm], and returns true only when the lowercase readback contains what the model pasted (or vice versa, to absorb minor re-encoding).

The same file pins the reliability ceiling at line 120 with AXUIElementSetMessagingTimeout(appElement, 5.0). That is your error budget per AX call: not a network round trip, but a hard 5.0 second OS guarantee that the call either completes or fails cleanly.

The send loop, line by line

Three blocks. Refuse if no chat is open. Walk the AX tree, find the compose AXTextArea, click it, paste, and press Return. Then re-walk the tree, find the newest 'Your message, ' bubble, and compare. Each block fails closed: if any guard returns, the response carries success: false and the reason.

Sources/WhatsAppMCP/main.swift
0sAX messaging timeout per call
0sWait between paste and clipboard restore
0sSettle window after pressing Return
0Max accessibility tree depth traversed

Four protections that keep automation invisible to the human

The verification round trip is the headline. The other three are the things that make the headline tolerable to use on the machine you are also working on. Without them, every send would stomp the clipboard you just used, fling the cursor across two displays, and leave Cmd half-pressed in the OS modifier state.

Clipboard backup and restore

pasteText() reads NSPasteboard.general.string(forType: .string), saves it, sets the message, posts Cmd+V, then restores the original 0.35 seconds later. The link your friend just copied is still there when the bot is done.

Cursor position is restored

Before every clickAt(), the server calls saveCursorPosition() and after the click posts a CGEvent that puts the pointer back where the human left it. The mouse does not jump.

Sticky modifier clear

Around every keyboard event, key-up CGEvents are sent for keycodes 55, 56, 58, and 59 to flush left and right Cmd, Shift, and Option from the OS modifier state. Otherwise the next keystroke the human types could be read as part of a Cmd combo.

Post-send readback

After the Return key, a fresh accessibility traversal looks for the newest AXGenericElement whose description begins with 'Your message, '. The text is compared to the input. verified: true is only returned when the readback contains what was pasted.

How the clipboard and modifier discipline actually look

pasteText is twelve lines. Read clipboard, clear, set message, post Cmd+V, sleep 0.35 s, restore. The modifier flush lives in sendKeyEvent and runs around every flag-bearing keystroke so that the user's next character is not interpreted as part of the bot's last shortcut.

Sources/WhatsAppMCP/main.swift

The same call, success and failure

A foreground WhatsApp window returns verified: true within a second. The same call to a minimised window returns the same shape with verified: false, because the post-send readback cannot find a freshly rendered bubble. The model sees both cases as structured JSON, on the same turn, and decides what to do.

verified: true vs verified: false, same tool

The full four-step automation a model runs

Search, open by index, verify which chat the click landed on, then send and observe. Every step has a structured return that the next step depends on. The last step is the only write; the three before it are checks. This is what makes the contract safe to expose to a model that does not have eyes.

search, open, verify, send

1

Search for the recipient

whatsapp_search('alex') types into the sidebar, scrolls into view, and returns up to 200 chars per AXButton description, parsed into index, section, contactName, preview, and time. The search panel stays open so the next call can pick a result.

2

Open the right chat by index, not by name

whatsapp_open_chat(index: 0) clicks the Nth result. Names alone are unsafe because two contacts can share a first name. The index is the position in the parsed sidebar list.

3

Verify which chat actually opened

whatsapp_get_active_chat() reads the heading panel at x > 1750 and parses last-seen, online, and typing prefixes off the front of the heading string. The model only proceeds if the name matches who it intended to message.

4

Send and observe

whatsapp_send_message('on my way') runs the loop above. The call returns verified: true only when the post-send AX read contains the pasted text. A false verified is the loud failure the rest of automation usually hides.

Side by side: API automation vs verified-local automation

Same problem, different success signal. The Business API path optimises for asynchronous broadcast. The local accessibility path optimises for synchronous, observable delivery on a single machine. Pick the one whose failure mode matches your use case.

verified: true on readbackAXGenericElement scan'Your message, ' prefixregex strips ', 3:14 PM'lowercase contains both wayscursor restored after clickclipboard restored after pastesticky modifiers flushed5.0s AX hard ceilingno Meta Business APIno template approvalno 24-hour window
propertyBusiness API automationwhatsapp-mcp-macos
Where 'success' comes fromA 200 OK from the Business API and, eventually, a delivery webhook from Meta. The sender never reads what landed.A second AX traversal that finds the newest 'Your message, ' bubble in the rendered chat. The sender reads back exactly what landed.
What 'failure' looks likeTemplate rejection, rate limit, account pause, or a delivery webhook that arrives minutes later, often after the user has moved on.verified: false in the same response, before the model's turn ends, with the actual last sent message echoed in the warning field.
Effect on the human's machineNone. The send happens on Meta's servers; nothing is written to the user's clipboard or focus state.Clipboard is touched briefly (paste) then restored. Cursor is moved to click, then put back. Modifier-key state is flushed. The user does not feel it.
Sender identityA registered business number Meta has approved.The human's own WhatsApp account, signed in to the Desktop app on their Mac.
Per-message costPer-conversation pricing set by Meta and varying by country and category.Zero. The OS does not charge per AXUIElementCopyAttributeValue.
Round-trip latencyNetwork-bound, plus a separate webhook for delivery state.Bound by two AX traversals plus a 1.0 s settle delay. Verification is in-band with the send, not asynchronous.

What a model should do when verified is false

Do not retry blindly. A duplicate WhatsApp message is worse than a slow render. The right behaviour is to call whatsapp_get_active_chat again and inspect recentMessages: if the new text now appears, the send did land and the only problem was that the AX tree had not stabilised when the post-send traversal ran. If after a second check the message is still absent, only then retry the send. The warning field on the original response carries the last 'Your message, ' bubble the server actually saw, which is usually the previous message and a useful diagnostic for the model's reasoning.

The reason this is even possible is that verification is part of the same tool call instead of a separate webhook. By the time the model is reading the response, the failure is already shaped like a structured object the next tool call can act on.

Want a verified-local agent on your own WhatsApp?

20 minutes to walk through the install, the four-step send loop, and where a verified false would matter for your use case. Bring your workflow, not your stack.

Frequently asked questions

What does 'verified' actually mean for an automated WhatsApp message?

It means a second pass over the macOS accessibility tree found a rendered message bubble whose text contains what the automation just pasted. Concretely, after the Return key event is posted, handleSendMessage in Sources/WhatsAppMCP/main.swift waits 1.0 second, calls traverseAXTree(pid:) again, and iterates every AXGenericElement that has a description starting with 'Your message, '. For each match it strips the trailing time pattern (regex ,\s+\d{1,2}:\d{2}\s*[APap][Mm]) and lowercases. The tool returns verified: true only when one of those readbacks contains the input or the input contains the readback. If the AX tree has no such node, verified is false and the response includes a warning field with whatever the last 'Your message, ' bubble actually said. Verification is part of the call, not a separate webhook, which is why the model can decide to retry on the same turn rather than queueing a follow-up.

Why does the automation backup and restore the clipboard around every send?

Because input on macOS Catalyst apps is most reliable through Cmd+V, and the only way to deliver the message text to the compose field without losing emoji or non-ASCII characters is to put it on the clipboard first. pasteText() at lines 225 to 237 of main.swift starts by reading NSPasteboard.general.string(forType: .string) and storing the result, then clears the pasteboard, sets the message, posts a Cmd+V keyboard event, sleeps 0.35 seconds for the paste to land, and restores the original clipboard contents. The window in which a human's clipboard is overwritten is at most a few hundred milliseconds, and the original contents are put back even if the send itself fails. Without this discipline, every send would silently overwrite whatever the user had just copied, and that is the kind of bug that makes people uninstall an automation tool the second time it bites them.

What protects the human's mouse cursor and modifier-key state during a send?

Two specific helpers in main.swift. saveCursorPosition() at line 195 reads NSEvent.mouseLocation, accounts for the screen-Y flip on the primary NSScreen, and returns a CGPoint. clickAt() saves that point before posting the mouse-down/mouse-up events and then posts a CGEvent of type mouseMoved back to the saved point as the last act of the click. The cursor returns to the place the user left it, frame by frame. For modifier keys, sendKeyEvent() flushes keycodes 55, 56, 58, and 59 (left and right Cmd, Shift, and Option) with isolated key-up events both before and after any flag-bearing event. Without that flush, a Cmd+V posted by the automation could leave the OS believing Cmd is still down, and the human's next keystroke would be misinterpreted as part of a shortcut. These are not features. They are the cost of sharing the same physical input devices with a human user.

Is this WhatsApp automation tied to the Business API or any Meta-side approval?

No. The server has no HTTP client targeting Meta. It binds to the WhatsApp Desktop macOS Catalyst app by its bundle id net.whatsapp.WhatsApp (line 57 of main.swift), launches it via /usr/bin/open if it is not already running, and drives it through AXUIElement calls. There is no API key, no access token, no template list, and no 24-hour messaging window. The env block in the MCP host configuration is literally an empty object. The only authority the server needs is the operating system's Accessibility permission, granted to the host process (Claude Code, Cursor, or whichever client forks it as a stdio child) in System Settings, Privacy and Security, Accessibility.

Where is the 5.0 second ceiling, and why does it matter for automation reliability?

AXUIElementSetMessagingTimeout(appElement, 5.0) is called on the WhatsApp application AX element at line 120 of main.swift, immediately after AXUIElementCreateApplication(pid). That timeout governs every accessibility call that follows: requesting children, fetching attributes, posting clicks. If WhatsApp Desktop fails to respond within 5.0 seconds, the call errors out cleanly rather than blocking the model's turn. The number is hardcoded; there is no flag to lengthen it. In practice a healthy WhatsApp window responds in tens of milliseconds, so the ceiling only trips when the window is minimised, the OS is under heavy load, or accessibility was just regranted and the TCC database is stale. When you see verified: false despite a successful paste, the first thing to check is whether one of the AX traversals timed out; the stderr log from the child will tell you.

Can the automation reach a WhatsApp number I have never messaged before?

No. There is no whatsapp_create_chat tool in the surface. The eleven exposed tools are status, start, quit, get_active_chat, list_chats, search, open_chat, scroll_search, read_messages, send_message, and navigate (assembled into the allTools array at line 1110 of main.swift). search returns indexed buttons that are already in the sidebar's chats and contacts sections; open_chat clicks one of them. If the recipient is not in your sidebar, the agent sees nothing to click. This scoping is intentional: it means the automation cannot perform a cold outreach on your behalf, which keeps the tool aligned with personal use rather than broadcast.

What does a verified false response look like, and what should the caller do with it?

It looks like a JSON object of shape {success: true, verified: false, to: "Alex", message: "...", warning: "Could not verify message appeared in chat. Last sent message: ..."}. success is true because the paste and Return key both succeeded; verified is false because the post-send AX traversal did not find a matching 'Your message, ' bubble. The warning includes the most recent outgoing message text the server did find, which is often the previous one if the new message has not rendered yet. The right caller behaviour is to wait briefly and call whatsapp_get_active_chat again to read the latest messages; if the new text now appears, the send did land. If after a second check the text is still absent, retry the send. Do not retry blindly on the first false: a duplicate send is worse than a slow render.

How does this compare with browser-based automation of WhatsApp Web?

Browser automation of WhatsApp Web has a documented bad reputation for this exact use case because the message input is a contenteditable element rather than a textarea, so the usual Playwright or Selenium type() helpers do not deliver characters reliably; emoji and non-Latin scripts are particularly brittle. Focus management is fragile, paste events on contenteditable behave differently than on text inputs, and the page is a heavy SPA that re-renders on every websocket frame, which makes any selector-based check race against the rerender. The macOS Catalyst path bypasses all of that by talking to the same accessibility framework that VoiceOver and other assistive tech use; the AX tree settles after each user-visible change, so the post-send readback observes a stable state. The instructions block at the bottom of main.swift explicitly warns the model not to fall back to browser automation if accessibility fails, because the failure mode there is silent corruption rather than loud refusal.