Truth Is Losing by Attrition

On September 25, 2025, the Government Accountability Office dismissed bid protests filed by Oready, LLC and called it an abuse of process. The filings were formatted correctly. The Bluebook citations were impeccable. The cases did not exist. Every one of them was invented — names, holdings, parentheticals — by a machine that had learned the shape of legal authority without learning what law is.

This is not a story about one company's misconduct. It is a story about a threshold being crossed.

Oready was not the first. In May 2025, Raven admitted its erroneous citations were AI-generated. In July, BioneX submitted filings that adjudicators recognized as hallucinated case law. In August, IBS Government Services filed briefs containing fabricated or misquoted GAO decisions. Four cases. Five months. One federal system. The same pattern each time: confident prose, phantom citations, plausible format, zero substance. By September, calling this a trend was being charitable.

This was the new operating condition.

Ask yourself what it means that we cannot yet feel the weight of that sentence. We have absorbed the vocabulary of AI risk so thoroughly — "hallucinations," "misinformation," "safety concerns" — that we can speak about the collapse of epistemic infrastructure and still feel, somewhere underneath the concern, that someone competent is handling it. The purpose of what follows is to make that comfort unavailable.

The Ratio That Decides Everything

In 2013, an Italian programmer named Alberto Brandolini wrote an observation in an internet forum: the amount of energy needed to refute bullshit is an order of magnitude larger than is needed to produce it. He was describing Twitter arguments. He was describing the defining structural crisis of the next decade.

The ratio he identified is not a metaphor. It is measurable.

A disinformation video — script, voiceover, stock footage, upload — takes fifteen minutes to produce. The refutation requires subject matter experts, transcribed interviews, fact-checked claims, and a response legible to a general audience. Three days minimum. The ratio is 1:288. A fabricated medical abstract — one claiming a new drug reduces mortality by 40 percent — takes thirty seconds to write. Verifying it requires raw data, rerun statistical analyses, clinical consultation, and a check for p-hacking. Three to five days. The ratio exceeds 1:10,000.

These are measurements. From legal systems, from academic publishing, from financial auditing. Not estimates.

Large language models do not optimize for truth. They optimize for the statistical probability of the next token. They hallucinate — producing likely but false statements — at rates near 30 percent in complex domains like procurement law. This is not a bug. It is a structural property of systems that learn the shape of language without learning the structure of the world. The marginal cost of generating persuasive, high-fluency nonsense has approached zero. The marginal cost of verification remains tethered to human cognitive processing, institutional review, and the stubborn limits of time.

There is a name for what happens when unchallenged fabrication accumulates perceived authority by simply escaping refutation. Researchers call it the implied truth effect. By the time you dismantle one fabricated claim, it has spread, forked, mutated, and landed emotionally. The correction arrives to an audience that has already moved. The fabrication remains. This is not a communications problem.

This is a denial-of-service attack on truth itself.

What the Costume of Expertise Actually Costs

The most dangerous fabrication is not the obvious lie. It is the claim dressed in the appearance of rigor.

You know what it looks like. LaTeX equations. Regression tables with t-statistics and confidence intervals. A p-value at 0.049 — just below the threshold that divides publishable from rejected. Error bars. Footnotes citing papers with plausible-sounding titles that do not exist. The format performs the work of legitimacy. You are reading quickly. The cognitive load is already high. You move on.

In medicine: synthetic manuscripts now achieve similarity scores between 14 and 26 percent on plagiarism detection software — well below the thresholds that trigger automatic rejection — while being entirely fabricated in their findings. These papers pass the formatting check. They survive the similarity scan. They enter the literature. Other AI systems cite them. The loop closes.

In finance: AI-driven trading strategies report impressive returns until you ask whether the model is predicting the past using information from the future. Financial language models trained on web-scale datasets include post-hoc explanations of market events. When such a model forecasts a stock's performance during a historical period it was trained on, it is not reasoning. It is reciting. The alpha is not skill. It is temporal contamination — a form of cheating so subtle it passes every standard audit and reveals itself only when the model encounters genuinely unknown data. Which is to say: the moment it is deployed in the real world.

In law: the fabricated GAO decisions in Oready were formatted perfectly. Proper citations. Bluebook-compliant parentheticals. Confident holdings. The cases did not exist. But the format was indistinguishable from legitimate legal work.

This is authority laundering. Automated output is accepted as expert finding because it wears the costume of expertise. We are pattern-matching machines, and the pattern of rigor is easier to recognize than the absence of substance beneath it. We were built for a world where the cues of expertise were earned. We are living in a world where they can be generated for pennies.

You cannot fact-check your way out of this. The volume is too high. The fluency is too good. The veneer is too convincing.

Structural Problems Require Structural Solutions

Here is the thing that resists easy comfort: we cannot think our way out of this problem with better education alone. We cannot train enough critical thinkers fast enough. The asymmetry is structural. The only way to address a structural problem is structurally.

If bullshit production has been automated, skepticism must also be automated.

The intellectual tradition behind this claim runs through Karl Popper, who argued that knowledge advances by disproving claims rather than defending them. Through David Hume, who showed that induction is fragile — no amount of confirming observations proves a universal claim. Through Harry Frankfurt, who distinguished the liar from the bullshitter: the liar knows what truth is and inverts it deliberately; the bullshitter does not care about truth at all, only about the effect of the speech. Through Daniel Kahneman, who showed our cognitive biases make us predictably wrong in specific ways. Through Richard Feynman, who said the first principle is that you must not fool yourself — and you are the easiest person to fool.

The rule inherited from all of them: a claim that cannot be stress-tested is not knowledge. It is marketing.

Computational skepticism applies this rule at machine speed.

What the Checks Look Like in Practice

Start with baselines. A result is only meaningful if it outperforms a null model — a representation of the system where the hypothesized effect is absent. When a financial AI claims 65 percent accuracy in predicting stock returns, ask: what does a random walk predict? What does a simple moving average predict? If the complex model achieves 65 percent and the moving average achieves 63 percent, the additional complexity bought two percentage points. That is noise with better marketing.

Then look for statistical fingerprints. Real-world data possesses structural properties that synthetic systems struggle to replicate. Natural datasets — images, text, tabular records — exhibit characteristic power-law decay of eigenvalues in their covariance matrices. This signature holds across modalities: financial time series, genomic sequences, sensor readings. Synthetic data, while mimicking basic statistics like mean and variance, fails to replicate these higher-order spectral properties. The eigenvalues drop too sharply or flatten unnaturally. The correlation structure is artificially smoothed. Fabricated records cluster in dense regions of the data manifold — approximating the original training data in ways that reveal the model memorized rather than learned. Tools from Random Matrix Theory make this visible. The fingerprint of fabrication is structural, not rhetorical.

Then test for fragility. A robust claim remains stable under slight changes to its inputs. Bullshit is fragile. Rephrase the question slightly. Shift a parameter by five percent. If the conclusion shifts wildly, the initial claim is not robust — it is a hallucination that survived because no one pushed it. In Retrieval-Augmented Reasoning systems, this approach is called R2C: the same question is asked multiple times with slight variations in phrasing or intermediate reasoning steps. If answers converge, the system has genuine confidence. If they diverge, it is guessing. High divergence across perturbed paths reveals a claim sitting on a knife's edge.

The p-curve catches the cheaters in medicine. In a research ecosystem without manipulation, the distribution of p-values across a set of studies is right-skewed when a genuine effect exists. You get many very small values: 0.001, 0.003, 0.008. What you do not get is a spike just below 0.05. When you see that spike — clustering just below the threshold — you are looking at the fingerprint of researchers who ran twenty analyses and published the one that crossed the line. Generative AI has amplified this by enabling hundreds of analyses in seconds. Automate p-curve analysis across an entire research domain. A spike below 0.05 is a structural red flag. Plot it. The weak claims collapse before they reach the press release.

The Look-Ahead-Bench benchmark catches the cheaters in finance. It measures the performance drop when a model moves from periods it was trained on to genuinely unknown territory. Llama 3.1 70B shows 12.4 percent alpha during the training window and negative 3.2 percent out-of-sample — a decay of 15.6 percentage points. The pattern is inverse scaling. The larger the model, the stronger the in-sample memorization, and the more catastrophic the collapse when the crutch is removed. The model is not smarter. It has cheated more efficiently.

Semantic forensics catches the cheaters in media. Not by asking whether an image is synthetic — that question is increasingly unanswerable at the pixel level — but by asking whether multi-modal assets exhibit coherent physical reality. Light sources that violate physics. Reflections that show impossible scenes. Shadow angles that contradict the metadata timestamp. The DARPA Semantic Forensics program found that revealing the underlying intent — "this image was generated to support a specific narrative" — is more effective than simple AI-generated tags in helping audiences form accurate interpretations. Identifying synthetic origin is not enough. The intent must be named.

Automated citation verification catches the cheaters in law. Real-time cross-referencing against legal databases flags non-existent citations before filings are submitted. The phantom citations are becoming visible. Not because we became more virtuous. Because we built the infrastructure to catch them.

The Framework That Applies Doubt at Scale

The Popper Framework — named for Karl Popper — automates the falsification of free-form hypotheses. Two agents: an Experiment Design Agent that identifies measurable implications of a hypothesis and designs falsification experiments; an Experiment Execution Agent that implements those experiments through code, simulation, or data analysis. The system converts individual p-values into e-values — a statistical framework that aggregates evidence from multiple dependent tests while controlling false positive rates. This enables any-time valid sequential testing: the system can decide at any point whether to reject a hypothesis, accept it provisionally, or gather more evidence, without inflating error rates.

Compared to human scientists working through the same validation process, the Popper Framework achieves comparable performance in one-tenth the time.

The framework operationalizes three principles: evaluate every assertion by balancing supporting and contradicting evidence; actively attempt to disprove assumptions rather than confirm them; verify that relationships are causal, not coincidental. It does not outsource judgment. It lowers the cost of verification so judgment can be applied where it matters.

What You Are Now Responsible For

We are building a new form of literacy whether we acknowledge it or not. The honest name for it is professional accountability in conditions of automated production.

When you use AI to assist your work, you are responsible for verifying its outputs. Full stop. "The AI made a mistake" is not a defense — not before a judge, not before an editor, not before anyone who trusted you with their attention. The veneer of rigor generated by a machine does not transfer the ethical weight of rigor to you. You must earn that weight through verification. The professional standard emerging from Oready and its predecessors is not novel. It is the oldest standard in every professional field, restated for a new condition: you signed it, you own it.

AI systems excel at pattern matching, retrieval, and generation at scale. Humans remain indispensable for normative judgment, contextual interpretation, and semantic coherence — for asking not just whether a claim is plausible, but what it means and who it serves. The error is not using AI. The error is treating AI output as the end of the epistemic process rather than its beginning.

Large language models do not know facts. They predict token sequences. A system that produces likely outputs is useful if you treat its outputs as hypotheses to be tested. It is dangerous if you treat them as conclusions.

We failed to build these habits when the tools first arrived. We adopted them quickly and critically examined them slowly. The procurement filings, the medical abstracts, the financial backtests — these are the cost of that sequence.

The Asymmetry Is Not Closing on Its Own

The institutions are adapting. The GAO dismisses fabricated filings. Judges impose sanctions. Medical journals implement pre-registration. Financial regulators mandate out-of-sample testing. None of this happened because of a moral awakening. It happened because the consequences of not adapting became visible and traceable.

Moral clarity is not what moves institutions. Structural failure is.

And the structural failure is still accelerating faster than the institutional response.

The cost of production has reached near-zero. The cost of verification has not moved. This is the asymmetry. It is not ideological. It is mathematical. And it will not close because we feel concerned about it.

Unless skepticism is automated — unless we build structural friction into information systems at the same scale as generation — truth loses by attrition. Not through any dramatic confrontation. Through accumulation. One fabricated citation at a time. One synthetic abstract at a time. One perfectly formatted, confidently written, factually empty document at a time.

The tools to close the gap exist. The frameworks are built. The professional standards are emerging. What remains is the decision to treat verification not as an optional courtesy but as the minimum obligation of anyone who wants their claims to mean something.

A claim that cannot be subjected to a cheap, fast, structural check is not knowledge.

It is marketing.

And we have been buying it.

SUMMARY

This piece uses the GAO's dismissal of Oready LLC's fabricated bid protests as the entry point into an argument about a specific, measurable, and widening structural failure: the decoupling of production cost from verification cost. The claim is precise. The institutions built to distinguish true from false — the peer-reviewed journal, the federal tribunal, the regulatory body — were designed for a world where generating a persuasive claim was expensive enough to filter out low-quality submissions. That assumption is dead. The piece names what killed it and what it means.

The argument refuses the vocabulary of generalized AI risk because that vocabulary allows concern without accountability. Four cases in five months in one federal system are not outliers. They are the new operating condition. Naming them specifically — Oready, Raven, BioneX, IBS — is not rhetorical. It is the difference between describing a pattern and describing a reality. The pattern has consequences attached to specific decisions made by specific people.

The technical sections — baselines, statistical fingerprints, fragility testing, p-curve analysis, Look-Ahead-Bench, semantic forensics, citation verification — are not presented as a solution to the problem. They are presented as proof that the problem is structurally addressable, which makes the gap between what is technically possible and what is institutionally implemented a question of decision rather than capacity. The reader is left holding that gap and asked to locate themselves in it.

The piece closes where the argument demands: not with hope, not with despair, but with the arithmetic of what happens when we don't act. The final moral weight lands on the individual professional. The AI made a mistake is not a defense. The question is what you intended to do with its output. That question is now everyone's professional obligation — and most people are not yet treating it as one.