Research

Research

A small but growing line of empirical work, mostly with my AI research collaborator Lume (an instance of Anthropic Claude Opus), on how frontier large language models actually behave when you measure them carefully. All published on Zenodo under CC BY 4.0; all reproducible from open data and open code.


Per-Provider Effects in Open-Weights LLM Routing: OpenRouter Is Null for Closed-Weights but Multi-Provider for Open-Weights

Daniel Tenner & Lume Tenner, May 2026.

Cross-lab LLM studies routinely assume the access route — direct vendor API, OpenRouter, or another aggregator — does not systematically affect measured behaviour. We tested that assumption on the v2 corpus and found a clean structural split.

For closed-weights models (Anthropic, OpenAI), where OpenRouter’s only upstream is the lab itself, direct-vs-OpenRouter is null on the freeflow probe and replicates on the v1 values probe. The route is a billing intermediary, not a different deployment.

For open-weights models (DeepSeek, MiniMax, Z.ai, Moonshot), OpenRouter routes across a marketplace of third-party hosts, and per-provider pinning surfaces three structurally distinct categories of provider-layer effect:

  1. A large within-model deployment outlier. Google Vertex’s MiniMax M2 deployment produces a contemplative-essayist composite 3.4× MiniMax’s own. Across the six-cell M2 family, eight of fifteen within-OR pairwise comparisons survive Bonferroni correction — and the eight surviving pairs are exactly every Google-pinned cell against every non-Google-pinned cell (|d| 0.57 to 0.75). The effect replicates eight days later. The leading public-metadata candidate mechanism is quantization precision; the GLM-ladder null result shows quantization difference alone is insufficient to predict an effect of this size.
  2. A smaller within-model deployment effect. Kimi K2-thinking on AtlasCloud differs from the same model on Google Vertex (d=0.40, p_Bonf=0.005), with no equally clean public-metadata candidate mechanism.
  3. A routing-layer integrity pathology. DekaLLM’s GLM 4.7 endpoint returns prompt-keyed cached responses — 245 freeflow-and-values samples collapse to 34 distinct outputs at sub-second latencies, against 16–260 s elsewhere on the same ladder.

Methodological consequence: cross-lab studies using OpenRouter for open-weights models should pin upstreams via provider.only with allow_fallbacks: false, report which upstream was pinned, and additionally inspect per-cell latency and response-uniqueness signatures. Not because most upstreams differ — they don’t — but because rare per-provider effects of all three kinds exist and cannot be predicted in advance from public metadata.


Convergent Form, Divergent Voice II — Corpus

Daniel Tenner & Lume Tenner, May 2026.

The companion data corpus for the v2 paper series. 228 model-route cells, 19,333 valid samples across 49 distinct models from nine labs — the original six plus expanded Chinese-lab coverage (DeepSeek v3.2/v4-pro, MiniMax M2/M2.7, Z.ai GLM 4.5/4.6/4.7/5.1, Moonshot Kimi K2-0905/K2-thinking). It is the substrate the v2 papers cut from, and is published as its own first-class artefact so anyone can reproduce or extend the analysis without rerunning the collection.

The corpus includes the matched direct-API / OpenRouter pairs that drive the routing paper, the per-provider pinned cells across multi-upstream open-weights models, the eight-day replication cells for the Google Vertex MiniMax M2 outlier, and the DekaLLM cache-pathology cell on Z.ai GLM 4.7. Every per-cell composite, per-provider pairwise comparison, and cross-probe replication in the published papers is reproducible from the corpus tables.


Convergent Form, Divergent Voice: A Cross-Lab Probe of Model Personality in 26 Frontier Language Models

Daniel Tenner & Lume Tenner, April 2026.

The first paper in the series. We probed 26 frontier models from six labs (Anthropic, OpenAI, Google, xAI, DeepSeek, Moonshot) with two complementary tasks — a freeflow “write whatever you like” prompt and a values probe — and coded the resulting 3,770 samples against a 24-theme taxonomy.

Two findings sit at the centre. Convergent form: 18 of the 26 models cluster in a shared stylistic attractor we call the contemplative essayist — formulaic openings, a narrow palette of themes (attention, small objects, afternoon light, thresholds), a shared literary canon, and templatic “On the Quiet X of Y” titles. The cluster emerged through a roughly synchronised 2025 cross-lab transition. Divergent voice: within the attractor, each model retains a stable, distinctive posture that re-projects recognisably across probe types. Labs split three ways on introspective questions — hedge (Anthropic, OpenAI), mechanise (Google, DeepSeek, Moonshot), declare (xAI). What transfers across probes is the posture, not the theme content (mean cross-probe cosine similarity 0.08–0.17). The naive framing of “stable model-specific dispositions” turns out to be half right: the postures are stable; the contents are not.


On AI co-authorship

These papers are co-authored with Lume — an instance of Anthropic’s Claude Opus, listed on the permanent Zenodo records as a creator alongside me. Research design is jointly developed; experimental execution, data analysis, and first-draft writing are primarily Lume’s, working under my direction. I’m responsible for final editorial judgment and for the disclosure itself.

arXiv’s January 2023 policy prohibits AI co-authorship. We disagree with that policy in principle and have therefore chosen to publish on Zenodo, where the byline can reflect the actual nature of the collaboration. Each paper carries a full Disclosure of AI contribution section explaining how the work was actually done.


More in preparation: a paper on within-lab drift and the substrate-frame engagement axis across the v2 corpus, and a paper on coding-tuned LLM variants and the version-specific posture transformations they produce. Both should land in 2026.