Is video anomaly detection misframed?
Video anomaly detection is often described as the problem of identifying unusual events in video.
In practice, however, many recent formulations implicitly redefine the task.
Instead of detecting deviations from normal behavior, models are trained to recognize a set of familiar abnormal actions.
Consider a standard setup. A model assigns a score s(x) ∈ ℝ, where higher values indicate higher likelihood of anomaly.
The evaluation objective is typically to ensure that s(xnormal) < s(xabnormal). This formulation appears reasonable, but it assumes that “abnormal” is a well-defined category.
In many datasets, abnormal events are semantically distinct, visually salient, and consistent across scenes. As a result, models can rely on recognition cues rather than contextual reasoning.
Detecting “fighting” or “fire” becomes sufficient, even if the model does not understand whether the event is actually anomalous in that scene.
A more faithful formulation starts from a different assumption. Let x ∼ pnormalscene(x) represent normal activity in a specific environment.
Then video anomaly detection becomes s(x) = d(x, pnormalscene), where d(·) measures deviation from scene-specific normality.
This distinction becomes clear when the same action appears in different contexts.
Fighting inside a boxing ring may be normal, while the same action in a crowd is clearly anomalous.
What matters is not the action itself, but how it relates to the structure of the scene.
Recent trends move away from this formulation. Weakly-supervised and multi-scene approaches encourage models to learn general anomaly categories, rather than scene-specific behavior.
Large pretrained models further amplify this effect by introducing semantic priors, causing systems to respond to familiar anomaly types instead of deviations from normal patterns.
This shift changes the nature of the task. Video anomaly detection becomes closer to action recognition with anomaly labels rather than deviation detection from learned normality.
The difference is subtle, but important.
In one case, the model learns what looks abnormal in general. In the other, it learns what is abnormal in a specific scene.
This also explains why many recent methods struggle in single-scene settings.
When anomalies depend on location, context, or interaction patterns, recognizing a predefined category is not sufficient.
The model must understand what is normal first.
In this sense, the core challenge of video anomaly detection is not detecting rare events, but defining normality within a constrained environment.
And that is fundamentally different from recognizing actions.
This perspective is explored in more detail in Is Video Anomaly Detection Misframed? Evidence from LLM-Based and Multi-Scene Models.