The single biggest predictor of whether your interview will tell you something useful is not who you hire to run it. It is whether you wrote the questions before the candidate walked in.
Unstructured interviews (the friendly chat, the gut-check, the rapport round) consistently underperform structured ones on every measure that has been studied. The research is decades deep, the effect sizes are large, and the practical implication for SMB hiring is the same answer almost every time: the hiring manager running the interview themselves cannot also be the calibration layer. The structure has to be the calibration layer.
What follows is the framework we use at Join when we design interview loops for ourselves, condensed for SMB recruiters who don’t have a dedicated talent partner.
What 25 years of research actually says
The reference text for selection-method validity is Schmidt and Hunter’s 1998 meta-analysis in Psychological Bulletin, covering 85 years of research findings across 19 selection procedures. The headline number that gets quoted most: structured interviews predict job performance at a validity coefficient of around .51, while unstructured interviews land near .38. Combined with a general mental ability measure, structured interviews push to .63.
An earlier meta-analysis by McDaniel and colleagues (1994) in the Journal of Applied Psychology, drawing on data from 86,311 individuals, found similar patterns: situational interviews outperform job-related interviews, which outperform psychologically-framed interviews; and structure beats unstructure across the board. More recent work from the SIOP community has nuanced the 1998 numbers without overturning the direction: structured interviews remain among the highest-validity selection methods available to a hiring team, especially when combined with work-sample tests.
For a quick mental table:
| Selection method | Validity (≈) |
|---|---|
| Work-sample tests | .54 |
| Structured interviews | .51 |
| Cognitive ability (GMA) | .51 |
| Unstructured interviews | .38 |
| Reference checks (unstructured) | .26 |
| Years of education | .10 |
The interesting line is the .38-to-.51 gap. That gap is not produced by hiring smarter interviewers. It is produced by writing better questions in advance.
Question types that predict
Three categories of question carry the load.
- Behavioural questions. “Tell me about a time when you had to ship a project on a deadline you knew was unrealistic. What did you change, who did you talk to, and what happened?” These ask the candidate to walk through a specific past instance. The signal is in the level of detail they can recall and how clean their account of cause and effect is. Vague answers correlate strongly with vague past performance.
- Situational (hypothetical) questions. “You join the team next month. The product manager has shipped a feature that’s leaking customers; engineering says the fix takes six weeks; sales wants it in two. What do you do this week?” These ask the candidate to reason through a job-relevant scenario in real time. McDaniel found situational interviews edged out behavioural for predictive validity in many job families. They also feel like the actual job, which improves candidate experience.
- Work-sample tests. A 30-to-60-minute task that resembles the real work. A take-home for an engineer; a 30-minute live edit for a writer; a sales-call roleplay. Schmidt and Hunter put work samples at .54 validity, higher than any interview format. The trade-off is candidate time and the legal landscape around unpaid work, especially in DACH where unpaid work samples beyond a token amount are restricted.
Google’s published research on their own hiring practice (re:Work) found internally that structured interviewing predicted on-the-job performance across functions and seniority levels, and reduced adverse impact on protected groups. That second part matters: structure is not only more predictive, it is more defensible.
Question types that don’t predict anything
Three categories show up in almost every interview loop and produce noise.
- Brain teasers. “How many golf balls fit in a 747?” Google’s own research found these had no correlation with job performance and were quietly retired. They survive in interview loops as a way for the interviewer to feel clever. Cut them.
- Rapport / “tell me about yourself” questions. Useful for the first thirty seconds of a conversation, useless as a hiring signal. They reward fluency, polish, and confidence (correlated with class background more than competence) and they prime interviewers to like or dislike a candidate before any signal has been collected.
- Self-assessment questions. “What’s your biggest weakness?” / “How would your last manager describe you?” The answers are uniformly coached, the calibration is impossible, and the variance you see is variance in self-presentation skill, not in fit for the role.
Cutting these does not save interview time. It saves interview attention, which is the bottleneck.
Building a scorecard the hiring manager will actually use
A structured interview without a scorecard is half the structure. The scoring is what forces the interviewer to compare to a standard rather than to the previous candidate.
The minimum-viable scorecard:
- One row per question.
- Each row has a 1-to-5 anchor scale with at least two anchors written out (e.g., “3 = walked through one project, recalled outcomes but not specific decisions; 5 = walked through one project, recalled specific decisions, cause and effect, and what they would change about their approach in hindsight”).
- Interviewer scores each row before any debrief with other interviewers, to avoid post-hoc rationalisation and groupthink.
- Final hire/no-hire decision is a function of the scores, not a separate gut call.
This is also the section that gets reused the most. The same 60-day outcomes that go into the job description (see our guide to writing job descriptions) are the same outcomes you should be scoring against in interviews. If the scorecard does not map to the job ad, one of the two documents is wrong.
For SMB hiring managers running the interview themselves, this discipline is the entire game. There is no panel of senior interviewers calibrating each other in the room with you. The scorecard, written before the candidate arrived, is what stands in for that panel.
Where AI helps, and where it stops
Drafting interview questions from the job description is a 30-second task for the model and a productivity assist for the human. Summarising the candidate’s answers into a structured record for later debrief is also fine. Both fall on the limited-risk side of the EU regulatory line.
Scoring candidates with AI is on the other side of that line. The EU AI Act classifies AI systems used for the recruitment or selection of natural persons as high-risk under Annex III, with full obligations (risk assessment, bias testing, human oversight, transparency) enforceable from 2 August 2026. An AI that ranks candidates or recommends hire/no-hire on the basis of interview answers is in scope. An AI that helps the human interviewer draft and remember is not.
The dividing line is the same one we apply across the product at Join: AI is the assistant, not the decision-maker. In an interview loop the scorecard is the assistant the hiring manager actually needs. The AI is a tool below that.
What this looks like in practice
For most SMB hires Join’s customers run, three to five structured questions per interview, two interviewers (one with veto), one work-sample step, and a scorecard filled in before debrief is the configuration that produces the cleanest hiring decisions. Adding interviewers past four rarely improves predictive validity in our customer pipelines; adding interview rounds past four rarely changes the decision. The thing that does change the decision is which questions got asked, and that is upstream of how many people are in the room.
The lesson from twenty-five years of meta-analyses, distilled: write the questions first. Calibrate before debrief. Treat AI as a drafting assistant, not a judge. The rest is editing.