Moving Beyond Black Box Fears: New Questions for AI in Hiring

Editors Note: This post is also being shared on Tenzo.ai/blog

As conference season ramps up, many talent leaders will be evaluating AI tools for hiring. If you’re preparing to sit down with vendors, you’ll probably hear familiar questions surface: What data did you train on? Won’t this amplify bias? Isn’t this just another black box?

Those questions made sense a few years ago. They were shaped by real scandals and public missteps. But they don’t always fit the reality of how large language models (LLMs) are being used in hiring today.

Why the Concerns Exist

The skepticism around AI in hiring didn’t appear out of thin air. A series of very public failures gave people reason to be cautious:

  • Amazon’s résumé screener (2014–2017). It was trained on historical hiring patterns and ended up downgrading résumés from women applying for technical jobs. Amazon eventually shut the project down, but the story spread quickly.

  • HireVue facial analysis (retired in 2021). At one point candidates were being scored on their facial movements during interviews. After pushback from advocacy groups and attention from regulators, the company dropped this feature.

  • Google and Microsoft AI misfires. Google Photos once mislabeled Black people in offensive ways. Microsoft’s chatbot Tay was manipulated into racist speech in less than a day. These weren’t hiring tools, but they made headlines and shaped public perception of AI.

  • Facebook job ad targeting. Employers used Facebook’s platform to keep certain groups, like older workers and women, from even seeing job ads. Lawsuits and EEOC action followed, reinforcing the idea that AI often excludes people unfairly.

Those examples created a lasting impression. Many buyers came away thinking of AI as unsafe, biased, and impossible to understand.

From Black Box ML to LLMs

It’s worth remembering that these failures were built on machine learning approaches that were common at the time. Those models were trained on historical data, they operated like black boxes, and they often just reinforced the patterns of the past.

Large Language Models (LLMs) have brought us to a different place. Instead of ranking résumés or copying old decisions, they can understand natural language, probe reasoning, and evaluate responses to specific questions. Old machine learning tried to predict “fit.” LLMs are better at analyzing individual answers to structured questions and then showing the evidence behind the score.

What LLMs Are Trained On

Models like GPT-4 or Claude were trained on a mix of licensed data, publicly available internet data, books, articles, and code. The important point is that they were not trained on résumés or past hiring outcomes.

The bigger issue that should concern buyers in 2025 is not the base training itself, but how each vendor uses the LLM in practice. The real questions should be: How is the model fine-tuned for this use case? How are prompts designed to evaluate skills instead of people? How are the outputs validated for fairness and consistency?

A Better Foundation: Assess Skills, Not People or Résumés

The safest and most compliant approach to hiring starts with a simple principle: assess individual skills directly. Do not infer them from résumés, job titles, or a sense of pedigree.

That means:

  1. Define the skills that really matter for the role. Break them down into observable knowledge, skills, and abilities.

  2. Map structured questions to each skill. For example, questions that probe problem solving, technical reasoning, compliance knowledge, or customer empathy.

  3. Score every response on its own. Each answer is evaluated against job-relevant criteria, not lumped into a single score.

  4. Compile scores into a skills profile. Results show strengths, emerging skills, and areas for growth.

  5. Keep human judgment in its proper place. Recruiters and hiring managers evaluate the person using the evidence, not the model.

Why AI Interviews Add Value

Once you have this structure in place, AI interviewing makes it possible to do at scale what humans cannot. Large language models don’t have to judge people or résumés. They can analyze responses, one question at a time, and provide depth that a résumé will never give you.

A résumé says, “This person held a certain title, so they must know X.” An AI interview shows, “Here is how this candidate actually responded to a question about X.”

One is an assumption. The other is evidence.

Résumé Ranking vs. Skills Evaluation

Then (Résumé Ranking) Now (Skills Evaluation) Counted keywords and formatting tricks Structured, skill-mapped questions evaluated one by one Repeated the same bias found in historical data Every qualified candidate has the same opportunity to respond Produced a single opaque score Independent scores per skill, question by question Rewarded schools, job titles, and formatting Transparent skills map: strong in A, emerging in B, growth area in C Left recruiters skimming résumés unevenly Recruiters review consistent, evidence-based responses Nearly impossible to audit or explain Every score ties back to a specific candidate answer

The Better Questions for 2025

Instead of asking, “What data did you train on?” ask, “How are individual question responses mapped to skills, and how is expertise scored?”

Instead of asking, “Won’t this amplify bias?” ask, “How do you make sure each skill is scored independently and transparently, with evidence I can review?”

Instead of asking, “What were LLMs trained on?” ask, “How is the model adapted to evaluate skills in a fair, explainable way for this role?”

Bottom Line

The scandals of the last decade explain the skepticism that still lingers. But ranking résumés, whether by humans or by machines, is not the right path forward.

The stronger path is structured skill assessment. Large language models make that possible at scale by evaluating responses to job-relevant questions, not résumés or people. Each answer can be scored independently, the results are transparent, and recruiters have real evidence to work with.

This isn’t about guessing who is qualified. It’s about giving every qualified candidate the chance to show what they know, fairly and consistently.

Next
Next

Yes, AI will replace some recruiters