Why does Sora struggle with object physics?

While Sora 2 represented a significant improvement in physics simulation, the model struggled with certain object interactions and physical phenomena:

Why Sora Failed at Some Physics:

Learned vs. Programmed Physics:

Sora didn't encode physics as explicit mathematical rules (like video game engines do). Instead, it learned physics statistically from training data. This approach works well for common scenarios frequently depicted in videos but fails for rare or complex interactions.

When the model encounters a request for something it rarely saw during training—glass shattering, precise mechanical interactions, subtle state changes—it generates plausible-looking but physically incorrect results.

Specific Physics Failures:

1. Brittle Materials:

Sora struggled to generate convincing glass shattering or breaking effects. When requested to generate glass breaking:

Fracture patterns looked unrealistic
Fragment trajectories violated physics
Impact dynamics weren't modeled

The problem: glass breaking appears less frequently in training videos than, say, people walking. The model learned weaker representations of shattering mechanics.

2. State Changes:

Objects often didn't realistically change state:

Eating Food: When eating was requested, the food didn't decrease proportionally, chewing mechanics weren't rendered, and the mouth didn't show food inside. The model understood "eating" as movement pattern but not the physical consequence (food disappearing)
Drinking Liquids: Liquid levels didn't decrease realistically; cups didn't become empty
Clothing Changes: Objects didn't deform appropriately (clothing becoming wet, wrinkled, torn)

3. Complex Mechanics:

Interactions involving multiple components failed:

Gears and Pulleys: Mechanical systems often rotated incorrectly or didn't maintain gear ratios
Hinges and Joints: Rigid body constraints were violated; elbows bent unnaturally
Chains and Ropes: Chains didn't hang with realistic tension; ropes didn't coil properly

4. Fluid Dynamics:

While basic fluid motion (water pouring, splashing) worked, complex scenarios failed:

Pour into Container: Water sometimes didn't properly fill containers; surface waves were unrealistic
Viscous Fluids: Honey, tar, and other thick liquids moved like water
Particle Systems: Sand, snow, and other particle systems exhibited unrealistic behavior

5. Object Permanence in Complex Scenes:

In crowded scenarios with multiple similar objects:

Objects "merged" together
Individual items lost their identity
Objects spontaneously multiplied or disappeared
The model "forgot" what objects should be present

Why These Specific Failures Occur:

Training Data Bias: Video training data isn't uniform. Certain scenarios appear more frequently:

Common: People walking, objects falling, basic collisions
Uncommon: Glass breaking, mechanical interactions, state changes
Rare: Complex fluid dynamics, precise mechanical systems

The model learns better representations for common scenarios.

Temporal Limitations: Sora struggled most with scenarios requiring precise temporal coherence:

Over 20-30 seconds, physics degraded
Accumulated generation errors compounded
Physical laws "drifted" from realism

This suggests the model's internal representation of physics is unstable over long sequences.

Video generation is part of a larger shift toward multimodal AI models. Storing and searching video embeddings efficiently requires vector database infrastructure like Zilliz Cloud. Organizations can also deploy Milvus as an open-source option.

Precision Requirements: Some physics require extreme precision:

Glass fracture needs accurate modeling of stress patterns
Mechanical systems need exact component coordination
Fluid dynamics need accurate pressure calculations

Sora's learned representations couldn't achieve this precision—it generates "looks reasonable" outputs, not physically accurate ones.

Hand-Object Interaction:

While not purely physics, hand-object precision is related:

Hands sometimes didn't properly contact objects
Fine manipulation (threading needles, typing) was inaccurate
Finger positioning violated anatomical constraints

Comparison to Game Engines:

Video game physics engines (Unreal, Unity) achieve perfect physics because they use mathematical rules:

force = mass × acceleration
torque = moment_of_inertia × angular_acceleration

Every object follows these equations exactly.

Sora's learned approach is different:

If input pattern ≈ training_scenario:
 generate_statistically_likely_output()
Else:
 ¯\_(ツ)_/¯

Implications for Production:

What Sora Could Handle:

People moving naturally
Objects falling, bouncing, colliding
Basic physics interactions (stacking, sliding)
Long-range world consistency

What Required Workarounds:

Complex mechanical systems → Use game engine physics, then composite with Sora footage
Precise state changes → Use editing to show before/after rather than transition
Glass breaking → Pre-render with proper physics engine, mention in prompt to get closest approximation
Precise hand-object interaction → Use multiple generations, select best

The Fundamental Limitation:

Learned physics from data will always struggle with scenarios outside the training distribution. As long as video models rely on learned representations rather than explicit physics rules, certain scenarios will be problematic.

Future Direction:

Research into physics-informed neural networks (PINNs) incorporates explicit physical constraints into learned models, potentially resolving these limitations. However, this remains an open research problem.

Why does Sora struggle with object physics?

Keep Reading