Essay

AI moderation without the false-positive tax.

2026-04-05·6 min read·The Veluxa team

There is a dirty secret inside every "AI moderation" product pitch: the false-positive rate.

Every vendor will tell you their classifier catches 98%of bad content. Almost none of them will tell you the other half of that number: how often they flag something that wasn't actually bad.

That's the tax. And it's the thing that kills community moderation in production.

The problem with optimizing for recall

AI moderation classifiers, almost universally, are trained to maximize recall — catch as much of the bad stuff as possible. Great in theory. In practice, you end up with a review queue that looks like this:

·Someone quotes a harassment complaint to report it: flagged as harassment.
·Someone says "this is sick" meaning "this is great": flagged as violence.
·Someone posts a Shakespeare quote: flagged as hate speech.
·Someone shares a news article about a real-world incident: flagged as violence.

You set a low threshold, you drown in false positives. You set a high threshold, you miss real abuse. And the moderator team burns out either way.

What we actually did

We spent three months iterating on this in closed beta. The answer we landed on isn't clever — it's just engineered carefully.

Layer 1: Two classifiers in parallel

Every incoming group message hits two classifiers simultaneously:

·OpenAI omni-moderation-latest. Tuned for harassment, hate, self-harm, sexual, violence, illicit.
·Perspective API. Tuned for toxicity, insult, identity-attack, threat, profanity.

Neither is perfect alone. Together, they disagree often enough that the disagreement itself is signal. If both flag at high confidence, it's almost certainly bad. If only one flags, it's probably the gray zone. If neither flags, it's fine.

This ensembling alone drops the false-positive rate by about 40% versus either model solo.

Layer 2: Two thresholds, not one

Every rule has two thresholds: an auto-action threshold and a review threshold. They are deliberately different.

·Above the auto-action threshold (typically 0.85 confidence): we just do the thing — delete, mute, ban. No review.
·Between review and auto-action (typically 0.40 – 0.85): the item lands in the human review queue.
·Below the review threshold: we log it and ignore it.

The key is that auto-action is deliberately conservative. We'd rather let 5% of abuse through to review than remove 0.5% of legitimate content automatically. Because the second kind of mistake is the one users never forgive.

Layer 3: Context weighting

Incoming messages are weighted by user history before being scored:

·New joiner (under 10 messages): confidence weighted +15%. First-day accounts are 4x more likely to be bots or bad actors in our data.
·Established member (over 90 days): confidence weighted −10%. Long-term members get the benefit of the doubt.
·Admin-tagged trusted: never auto-actioned, only ever lands in review.

This alone drops false positives another 20–30% in active communities.

The math

With all three layers, across 4.1M incoming messages we processed in our April cohort:

| Metric | Value | |---|---| | True positives (real abuse caught) | 94.2% | | False positives (legitimate flagged) | 1.7% | | False negatives (abuse missed) | 4.1% | | Human review queue volume | 2.3% of messages |

1.7% false positives is the number we're proud of. Industry benchmarks we've been able to find sit between 6 and 14%. Below 2% is where moderation stops being a tax on legitimate users and starts being a tool that actually works.

The human-in-the-loop piece

Every item in the review queue is decided by a human — approve, remove, or ban — with one click. Every decision writes to the audit log with timestamp, actor, confidence score, and reason.

That audit log is exportable as CSV, which matters for two audiences:

·Regulated industries (fintech, healthcare, legal) who need to prove what happened.
·Community managers who need to show their founders that yes, the moderation team is actually working.

Both matter. We built for both.

What this doesn't solve

We're not claiming AI moderation is finished. Specifically, we're bad at:

·Sarcasm and inside jokes. Classifiers don't get them. We flag them, you approve them.
·Code-switched languages. A message that mixes Hindi and English in Devanagari + Latin scripts is hard. We're getting better, we're not done.
·Sophisticated scams. Careful phishers who write perfectly grammatical English slip through text classifiers. We run a separate scam-pattern classifier trained on our own data that catches most of these, but not all.

If any of those are your main problems, talk to us. We're probably working on your exact edge case already.