Declared First · Part 5 of 5

What I-O Psychology knows about human data that ML doesn't

I-O has spent a hundred years asking one question with unusual rigor: when a human looks at another human and writes something down, what just happened? That question is now the bottleneck of frontier AI.

May 6, 2026 · journal · essay · i-o psychology

I trained as an Industrial-Organizational and Engineering Psychologist before I trained as a product manager. The discipline is not famous. It should be. I-O has spent a hundred years asking one question with unusual rigor: when a human looks at another human and writes something down, what just happened?

That question is now the bottleneck of frontier AI. Every reward model, every preference dataset, every red-team rubric is built on humans looking at outputs and writing things down. The field doing the looking has mostly never heard of the field that figured out how to look.

This is the gap. I want to name it precisely, then show what closing it looks like.

Five things I-O knows

Plain language. No jargon I do not unpack.

Construct validity. Before you measure a thing, you have to know what the thing is. "Helpfulness" is not a construct. It is a vibe wearing a lab coat. A construct is a defined behavior with boundaries: what counts, what doesn't, what looks similar but isn't. I-O writes these definitions before it writes the rubric. ML usually writes the rubric first and discovers the construct in the disagreement.

Inter-rater reliability. When two raters disagree, there are two possibilities: the item is hard, or one of the raters is sloppy. These are different problems with different fixes. I-O has math that tells them apart. Cohen's kappa, Krippendorff's alpha, generalizability theory. ML annotation pipelines mostly compute simple percent agreement and shrug.

Range restriction. If your raters all come from the same population, your data tells you about that population, not about humans. The Stanford-undergrad problem is the famous version. The contractor-marketplace problem is the current version. The fix is not more raters. The fix is a sampling frame.

Item response theory. Not every prompt is equally informative. Some discriminate between strong and weak models. Some are too easy and tell you nothing. I-O has fifty years of methodology on calibrating the instrument, not just the respondent. ML evals mostly calibrate the model and treat the benchmark as fixed.

Adverse impact. When a measurement system produces systematically different outcomes for different groups, that is a measurable property of the system, not a vibe. I-O developed the math because the courts demanded it. ML is rediscovering the same math under the name "fairness," usually without the case law.

Why ML keeps tripping on this

Reward hacking. Brittle evals. Silent disagreement between annotators that surfaces six months later as a model behavior nobody can explain.

These are not new problems. They are I-O problems with new names.

The pattern is consistent. ML reaches a scale where the human-data layer becomes the bottleneck, then reinvents a worse version of the methodology that was already there. The reinventions are often clever. They are also slow, and they cost real safety. When Claude evaluates Claude, the field is rediscovering inter-rater reliability with a model in one of the rater seats. Useful work. Worth doing. But the discipline already has a hundred years of theory on what happens when raters drift, when the scale itself is the problem, when the construct was never defined. Importing that theory is cheaper than re-deriving it.

Three moves I-O brings to RLHF

Concrete, not theoretical. Each one is a thing I have shipped.

One. Define the construct before you write the instruction.

At Amazon, I worked on a daily employee-feedback system that went out to a population larger than most cities. Before any single question reached a single person, the construct had to survive a definitional review: what behavior, in what context, observable by whom, distinguishable from what. The instruction was the last artifact in the chain, not the first. Every word in it was downstream of a definition we could defend.

Apply this to RLHF. The rubric is not a substitute for the construct. It is a translation of the construct into rater-facing language. If you cannot define "helpful" without using the word "helpful," your rubric is a mirror, not an instrument. Your raters are not labeling. They are projecting. The model learns the projection.

Two. Use agreement statistics that distinguish hard items from sloppy raters.

At Instill AI, we built culture analytics for organizations. The whole product was a measurement system pointed at human judgment. We computed agreement at the item level and at the rater level, separately, every time. An item with low agreement told us the question was hard or the construct was poorly defined. A rater with low agreement against everyone else told us the rater was the problem. Different fix.

ML annotation usually computes agreement once, at the dataset level, and uses it as a quality gate. That is like averaging the temperature of every patient in a hospital and calling it the health of the building. The information you need lives in the variance, not the mean. Track item-level kappa. Track rater-level kappa. The hard items are where your model will surprise you. The sloppy raters are where your model will quietly learn the wrong thing.

Three. Calibrate the scale, not just the rater.

At Procore, in the early days of construction tech, I learned that the data import tool is the product. If the scale you hand the user is broken, no amount of training fixes the data. We spent more time on the instrument than on the people using it.

ML rarely does this. The benchmark is treated as fixed and the model is treated as the variable. But benchmarks rot. Items that discriminated last year saturate this year. Items written for one model class fail to discriminate for another. Item response theory has the math: each item has a difficulty parameter and a discrimination parameter, and both can be estimated empirically and updated as the population of models shifts. Eval suites should be living instruments, not stone tablets.

Receipts

I have shipped products that respect both the I-O lineage and the ML one.

Connections at Amazon. Construct definition before instruction-writing, at population scale. Agreement statistics computed at the item level. The system was a measurement instrument first, a feedback loop second.

Imports at Procore. The scale-is-the-product principle, applied to construction data. If the field schema does not match how a foreman actually thinks about a punch list, you do not have a data quality problem. You have a construct problem.

Instill culture analytics. Agreement decomposition as a first-class feature. Hard items surfaced separately from sloppy raters. The product told you which was which.

Inyeon AI's compassion layer. The OFNR extraction (Observation, Feeling, Need, Request) is a construct schema, not a vibe. We declared the four constructs before we wrote a single prompt. Every downstream behavior is downstream of that declaration.

Silent Witness's nine-type evidence schema. Built with a collaborator for the Google Gemma 4 Good Hackathon. We reframed the project from "evidence vault" to "evidence completion engine." That reframe was a construct move. Once the nine types were declared, the trauma-informed UX wrote itself, because the user was no longer being asked to label their experience. They were being asked to fill in a frame they could see and trust.

The principle behind all of this

Declared beats inferred. I have written that line before, in other contexts. It applies here too.

Inferred constructs are what you get when you skip the definition step and let the rubric, the raters, and the loss function negotiate amongst themselves. The model learns the negotiation. The negotiation is not stable. Six months later you are debugging a behavior that was never specified, only discovered.

Declared constructs are what you get when a human writes down what the thing is, before anyone tries to measure it. The declaration is auditable. The disagreements become legible. The model has something to be aligned to that is not a moving target.

I-O Psychology is the field that figured out how to declare. ML is the field that is, finally, ready to listen.

The bridge is not theoretical. It is methodological. And it is shippable today.

Working notes: This post sits in a series on declared frames, alongside pieces on retrieval-first AI, declared relationships, and pattern weaving. The receipts above are the through-line: every product I have shipped is a declaration about what humans were doing, made before the system tried to measure them.