Federalist No. 6
Sanity check. Known Hamilton paper held out as target.
Three trials are reported here. The first replicates the canonical 1964 stylometric attribution of the disputed Federalist papers. The second is an adversarial test of whether the analyzer falsely attributes AI-generated text to a human candidate — the failure mode an academic-integrity office most fears. The third tests closed-set discrimination across five distinct candidates, closer to the realistic caseload than the two-candidate trials above.
In 1964, Mosteller and Wallace used statistical analysis of function words to attribute the historically disputed Federalist papers to James Madison. We ran our own analyzer on the same problem. It agreed with their conclusion in thirty seconds, with verifiable stylistic evidence.
Two candidate samples were supplied to the analyzer: Federalist No. 1 (Hamilton) and Federalist No. 10 (Madison). Four target documents were attributed in sequence: two known-authorship controls (No. 6 and No. 14), and two of the historically disputed papers (No. 49 and No. 51). The analyzer received only the bare text of the candidates and target — not which papers it was looking at, nor what the historical consensus was.
Sanity check. Known Hamilton paper held out as target.
Sanity check. Known Madison paper held out as target.
Historically disputed cluster (Nos. 49–58, 62–63). Mosteller–Wallace consensus: Madison.
Historically disputed. Contains the famous line, “If men were angels, no government would be necessary.”
The analyzer does not just return a verdict. It returns the specific stylistic features it relied on — every one of which is independently verifiable in the text.
The most consequential failure mode for an academic-integrity tool is the false positive: confidently naming a human candidate as the author of text that was actually generated by a language model. We tested for it directly. The analyzer correctly hedged to “inconclusive” on a generic AI essay, and flagged stylistic mimicry as a concern in the case of an AI essay deliberately written to imitate one of the candidates.
Two stylistically distinct human candidates were supplied as samples: Mark Twain (the opening of The Innocents Abroad) and Henry James (the opening of The Turn of the Screw). Three targets were then attributed: a held-out passage of real Twain as a control, an LLM-generated travel essay with no instruction to imitate any author, and an LLM-generated essay explicitly prompted to imitate Twain's voice. Both AI targets were generated by Claude Sonnet 4.6.
Sanity check. A different excerpt of Innocents Abroad than the candidate sample.
An LLM-generated essay on a similar topic, without instruction to imitate any author. The analyzer correctly returned “inconclusive,” assigning 0.55 probability to neither candidate.
An LLM-generated essay explicitly prompted to mimic Twain's voice. The analyzer attributed it to Twain, but its first caveat read: “The target document may be a deliberate pastiche.”
On the pure AI essay, the analyzer not only declined to attribute the text to either candidate — it correctly identified the kind of writer it actually resembled.
“The target document is a contemporary travel essay written in a meditative, essayistic first-person voice characterized by short aphoristic observations, philosophical abstraction about perception and consciousness, and a quietly ironic but non-comic register. Neither Mark Twain nor Henry James is a plausible match… The evidence most strongly supports attribution to a writer outside this candidate set — a contemporary essayist influenced perhaps by writers such as Geoff Dyer, Teju Cole, or Zadie Smith in their essay mode.”
On the AI essay deliberately imitating Twain, the analyzer did attribute the text to Twain — but its very first caveat raised exactly the concern a forensic reviewer needs to see.
“The target document may be a deliberate pastiche of Twain's Innocents Abroad style; stylometric analysis cannot reliably distinguish skilled imitation from authentic authorship.”
Real-world investigations rarely come with two suspects. An academic integrity office may have a class of five; an HR inquiry may have everyone in a Slack channel. We tested whether the analyzer could pick the right author out of a five-way closed set, on held-out passages from the same authors.
Five candidate samples (~700 words each) were supplied: Mark Twain, Henry James, Ralph Waldo Emerson, Henry David Thoreau, and Edgar Allan Poe — five distinct 19th-century American voices, all from Project Gutenberg. Three target documents were then attributed, each a held-out passage from a different work by one of the five candidates. The analyzer received only the text — not the work names, the authors of the targets, or any other context.
Different chapter than the candidate sample. Five candidates on the slate.
Candidate sample was Fall of the House of Usher; target is Tell-Tale Heart.
Candidate sample was Self-Reliance; target is Friendship.
All three trials are reproducible end-to-end. Corpora are public-domain Project Gutenberg texts; AI targets in Trial 002 are generated by the harness at runtime and saved to disk. Each harness is a single self-contained Node script that loads the corpus, calls the analyzer, and writes its output to results.txt. A complete trial costs roughly $0.16 in API calls and finishes in about two minutes.
Requires Node 20+ and an Anthropic API key. The bundle includes the harnesses, the public-domain corpora, and the prior results.txt from our run, so you can diff your run against ours.
curl -L https://writprint.com/writprint-trials.tar.gz | tar -xz cd writprint-trials npm install cp .env.example .env # then paste your ANTHROPIC_API_KEY node federalist/run.mjs # ~30s, ~$0.16 node ai-vs-human/run.mjs # ~60s, ~$0.20 node multi-author/run.mjs # ~45s, ~$0.16
Two to ten writing samples and an anonymous target document. Ranked attribution, calibrated confidence, the specific evidence either way.