A Conversation About AI That Actually Matters

AWS Heroes meet Now Go Build CTO Fellows at re:Invent

Dec 13, 2025

Around the table sat people managing educational platforms for 800,000 students, disaster-response systems in active war zones, healthcare assessments for vulnerable populations, and food-recovery operations across 20 regions. AWS Heroes—technical experts who’ve built systems at massive scale—sat alongside Now Go Build CTO Fellows—technology leaders working at the absolute frontlines of human need.

The question on the table: How do you actually deploy AI when getting it wrong means people get hurt?

Not “how do you optimize for engagement” or “how do you reduce churn.” How do you build systems that serve people who can’t advocate for themselves, in places where infrastructure barely works, with teams of six people trying to reach millions?

The AI Tutor That Has to Work Without the Internet

One of the first challenges came from someone running an educational platform serving 800,000 learners across every country. Their team: 25-30 people. Their current approach: AI helps create and localize content, but PhD-level experts review everything before it reaches students.

It’s careful. It’s responsible. And it doesn’t scale to where they need to go.

“In the countries where we work, students miss entire days of school regularly,” they explained. Military conflicts. Infrastructure failures. In some places, every Monday. The dream: AI tutors that run entirely on phones, work without internet, and can teach kids even when nothing else is working.

But here’s the catch—how do you ensure safety when the AI can’t check back with a server for guardrails? When there’s no human in the loop because there’s literally no loop to be in?

The room started discussing. Keep the AI tightly bound—it can only talk about specific curriculum topics. If a student asks something outside that scope, it just says, “I don’t know how to answer that.” Don’t try to be helpful beyond your domain.

Someone raised a more subtle problem: “Vector search will find semantically similar content, but it will miss gaps of information that are relevant but don’t match.” In other words, the AI might find related information, but miss the crucial context that makes it meaningful. The solution involves explicitly mapping how concepts relate to each other, not just how similar they are.

For the highest-stakes applications, there’s even a way to get mathematical proof of safety—similar to how Boeing and Airbus verify their flight systems. The AI generates the rules, humans verify that they are correct, and the system is guaranteed to stay within those boundaries.

But then someone pushed back on the whole premise: “The research shows learning requires friction. We shouldn’t eliminate struggle; we should give learners tools to manage it productively.” This hit something important. The goal isn’t to make everything easy. Real learning involves being challenged, even frustrated. The question is how to support students through that difficulty, not eliminate it.

Later, someone set an ambitious bar: “Until the learning is as addictive as a Netflix series where you watch eight hours of it, we have not got our job done.” But they also acknowledged a tension: “My kids use some of these AI-enhanced learning platforms now, and they hate them because they know the AI is constantly assessing them.”

There’s the paradox. We want engagement. But surveillance isn’t engaging—it’s oppressive. The technology that measures everything risks destroying the intrinsic motivation that makes learning work.

The Healthcare System That Had to Be Perfect

Then came a different kind of challenge. A healthcare organization analyzes surveys and personal stories to identify psychosocial factors—abuse, housing instability, depression—that healthcare workers need to know about. The AI’s job: read through 57 questions plus long-form written responses and flag what matters.

But these flags go into clinical decisions. They have to be right. “How do I ensure quality control with a small team of subject matter experts when we’re about to scale?”

Turns out, most of it was structured—multiple choice, scales, yes/no answers. Then they discovered this wasn’t really an AI problem. For structured data, you can use statistical methods that give you the same answer every single time. No randomness. No drift. Just consistency. The AI only needed to handle the free-text responses where people write about their lives in their own words. That’s where language understanding matters.

But even with the right technical approach, the validation problem remained. How do you check thousands of assessments with a handful of experts? Use project management tools teams already know. When the AI makes a prediction, it automatically creates a review task. The expert approves or corrects it. The corrections feed back into the model. No custom software needed.

Run the same data through multiple different AI systems. If they all agree, confidence goes up. If they disagree, it gets flagged for mandatory human review. “That’s what Boeing and Airbus do,” someone noted. “Multiple systems, voting mechanisms.” You don’t need to validate everything. Check 100% when the AI says it’s uncertain. Sample the rest. Track accuracy over time. For validations that don’t require deep medical expertise, use crowd validation where multiple people review the same output, and consensus determines acceptance.

Then came a fascinating insight about human psychology. How you frame AI predictions changes how people respond to them.

“There’s a 30% chance this is wrong” makes people more careful than “70% confidence it’s right”—even though these mean the exact same thing. For life-and-death decisions, frame uncertainty as risk. It changes behavior.

Trust in the Age of Lies

They work across conflict zones—Ukraine, Turkey, Mexico, places where humanitarian workers need accurate information to stay alive. Every day, they compile situation reports from multiple sources: social media, news, and ground reports from local contacts. Here’s the nightmare scenario: a convincing video of an explosion makes it into a situation report. Workers avoid a safe route or miss a critical intervention window. Except the explosion never happened. It was AI-generated.

“False information doesn’t just waste time,” they said. “It puts lives at risk.”

And here’s what makes it harder: “The humanitarian workforce is significantly aging. With funding cuts, organizations aren’t hiring new people. The digital literacy problem is getting worse, not better.”

AI can scrape and aggregate information orders of magnitude faster than humans. But speed without accuracy is worse than useless—it’s dangerous.

A possible solution is to enforce data lineage. Track everything. Where did this information come from? How was it processed? Who else is reporting it? This isn’t just for validation—when something seems suspicious, you need to be able to trace back through every step. Look for technical indicators. Images and videos contain metadata—timestamps, GPS data, and visual artifacts from editing. These can be analyzed to flag content that appears manipulated.

Someone proposed an elegant solution to the training problem: build it directly into the workflow. Instead of separate training sessions, people forget about the system challenges that reviewers face in the moment. “This image has characteristics consistent with AI generation. What indicators do you see?”

The system doesn’t let them move forward until they articulate their reasoning. Right or wrong, they’re thinking critically about that specific case.

Other Realities

Mapping with biased AI. Someone working on geospatial mapping explained that global AI models trained primarily on US and European data perform terribly in developing regions. Deploy them to map villages in Rwanda, and they miss buildings or misclassify structures entirely. Their solution turned the problem into an opportunity: local community members provide feedback that fine-tunes the models for their specific region. Simple mobile interfaces—swipe yes or no, is this a building?—create a continuous improvement loop. “This is about community inclusion,” they said. “Locals train the systems that map their environment. It addresses the bias in training data and makes the whole process transparent.”

Health information in hostile territories. A reproductive health organization works in regions where providing its services is literally illegal. They need to provide accurate, potentially life-saving information, but they can’t collect the personal data needed for AI personalization. The constraints are severe. Users may be in immediate danger. The information must be medically precise. The system must detect emotional distress and escalate to human counselors when needed. Their approach emphasizes curation over generation: the AI draws only from pre-validated medical resources, every response includes clear citations, and there’s always a prominent path to reach a human counselor.

When voice changes everything. Multiple people raised the challenge of serving populations with limited literacy, particularly in languages without strong written traditions. The shift from text to voice interfaces made a dramatic difference: “Once we moved from text chatbot to voice-based, we saw significantly better engagement with low-literacy populations. The emotion detection was critical—if someone sounds sad, you need to adapt the interaction.” Voice with emotion detection allows systems to respond not just to what people say, but how they say it. It makes the interaction feel human in a way text never can.

As the conversation wound on, certain truths kept surfacing, sometimes explicitly, sometimes as subtext to other discussions.

Not every problem needs AI. For structured data with clear rules, traditional statistical methods often work better. They’re consistent, auditable, and don’t have the unpredictability of generative systems. The maturity isn’t in knowing the latest AI techniques—it’s in knowing when not to use them. Sometimes a problem is “just” a classic machine learning problem, not a generative AI problem.

Small teams can provide oversight. You can’t manually review every AI output, but you also can’t deploy systems without oversight. The answer isn’t choosing one or the other—it’s being strategic about where human judgment adds value. Validate everything when confidence is low. Sample when confidence is high. Use tools teams already know rather than building custom systems. Have corrections automatically improve the model.

Humans need context, not just answers. When AI makes a prediction, showing why it reached that conclusion helps humans effectively validate it. Citations, confidence scores, explicit reasoning—these aren’t nice-to-haves. They’re what make the partnership between human and machine actually work.

How you frame things matters more than you’d think. That insight about “30% chance this is wrong” versus “70% confidence it’s right”—mathematically identical, psychologically different—kept coming up. For high-stakes decisions, frame uncertainty as risk. It changes how people think.

Trust requires transparency. Track where information came from and how it was processed. When something goes wrong, trace back to find where the error was introduced. This isn’t bureaucracy—it’s the foundation of accountability.

Training should be embedded, not scheduled. Rather than separate training sessions, people forget; they build challenges directly into the workflow. Make people articulate their reasoning before accepting or rejecting AI outputs. Provide guidance in the moment when they struggle. Learning that happens in context sticks.

Some things should stay human. This was the insight that kept resonating. Some interactions should remain human even if they could technically be automated.

“Automation decisions aren’t purely technical—they encode values about human dignity and relationship importance.”

Technology should serve human flourishing, not just efficiency. Sometimes the relationship itself is the value being delivered.

What does this mean?

In typical tech discussions, AI is about optimization, efficiency, and competitive advantage. There, the question was whether a student in an active war zone could learn math. Whether a humanitarian worker would walk into danger based on fabricated information. Whether someone in crisis would get help or get dismissed.

Start with the constraint, not the capability. The best solutions came from embracing limitations—offline-first design, minimal data collection, small teams—rather than treating them as compromises. Design for the world as it is, not as you wish it were.

Choose boring technology when it solves the problem. The sophistication to use simpler statistical methods instead of AI, or traditional databases instead of vector search, often leads to better outcomes. Complexity should be justified by requirements, not assumed by default.

Make humans and AI partners, not adversaries. The goal isn’t to validate whether the AI was “right.” It’s about combining human judgment with AI capabilities to make better decisions than either could alone. Design systems where humans add value, not just check boxes.

Build feedback loops into everything. Every human correction should automatically improve the system. If your validation process doesn’t feed back into training, you’re missing the opportunity to get better over time.

“I’m interested in finding the balance between using AI to enable people versus using AI to replace them. The higher order tends to view it as replacement rather than enablement.”

That’s the real tension, isn’t it? Not the technical challenges—those are solvable. The harder question is what we’re solving them for. AI should amplify human capability, not substitute for human judgment. It should help small teams serve millions without losing the human connection that makes the service meaningful.

This isn’t idealism. It’s practical. Systems that lose the human element, that treat efficiency as the only goal, that optimize for metrics at the expense of relationships—they ultimately fail to serve the populations they’re designed to help.

Human-in-the-loop is the design principle that makes AI trustworthy for populations that can’t afford our mistakes. The technical problems—validation at scale, offline operation, handling uncertainty—those have solutions. The harder work is ensuring we’re solving them in service of human flourishing, not just scale.

Because somewhere, there’s a student who needs to learn despite conflict disrupting school. A healthcare worker making decisions that affect people’s lives. A humanitarian worker whose safety depends on accurate information. A person in crisis who needs help, not judgment. They can’t advocate for themselves in these technology decisions. So we have to advocate for them. We have to build systems worthy of their trust.

That’s what sustainable and ethical AI stands for.

Tech on the Stack

Discussion about this post

Ready for more?