While Sora 2 represented a significant improvement in physics simulation, the model struggled with certain object interactions and physical phenomena:
Why Sora Failed at Some Physics:
Learned vs. Programmed Physics:
Sora didn't encode physics as explicit mathematical rules (like video game engines do). Instead, it learned physics statistically from training data. This approach works well for common scenarios frequently depicted in videos but fails for rare or complex interactions.
When the model encounters a request for something it rarely saw during training—glass shattering, precise mechanical interactions, subtle state changes—it generates plausible-looking but physically incorrect results.
Specific Physics Failures:
1. Brittle Materials:
Sora struggled to generate convincing glass shattering or breaking effects. When requested to generate glass breaking:
- Fracture patterns looked unrealistic
- Fragment trajectories violated physics
- Impact dynamics weren't modeled
The problem: glass breaking appears less frequently in training videos than, say, people walking. The model learned weaker representations of shattering mechanics.
2. State Changes:
Objects often didn't realistically change state:
- Eating Food: When eating was requested, the food didn't decrease proportionally, chewing mechanics weren't rendered, and the mouth didn't show food inside. The model understood "eating" as movement pattern but not the physical consequence (food disappearing)
- Drinking Liquids: Liquid levels didn't decrease realistically; cups didn't become empty
- Clothing Changes: Objects didn't deform appropriately (clothing becoming wet, wrinkled, torn)
3. Complex Mechanics:
Interactions involving multiple components failed:
- Gears and Pulleys: Mechanical systems often rotated incorrectly or didn't maintain gear ratios
- Hinges and Joints: Rigid body constraints were violated; elbows bent unnaturally
- Chains and Ropes: Chains didn't hang with realistic tension; ropes didn't coil properly
4. Fluid Dynamics:
While basic fluid motion (water pouring, splashing) worked, complex scenarios failed:
- Pour into Container: Water sometimes didn't properly fill containers; surface waves were unrealistic
- Viscous Fluids: Honey, tar, and other thick liquids moved like water
- Particle Systems: Sand, snow, and other particle systems exhibited unrealistic behavior
5. Object Permanence in Complex Scenes:
In crowded scenarios with multiple similar objects:
- Objects "merged" together
- Individual items lost their identity
- Objects spontaneously multiplied or disappeared
- The model "forgot" what objects should be present
Why These Specific Failures Occur:
Training Data Bias: Video training data isn't uniform. Certain scenarios appear more frequently:
- Common: People walking, objects falling, basic collisions
- Uncommon: Glass breaking, mechanical interactions, state changes
- Rare: Complex fluid dynamics, precise mechanical systems
The model learns better representations for common scenarios.
Temporal Limitations: Sora struggled most with scenarios requiring precise temporal coherence:
- Over 20-30 seconds, physics degraded
- Accumulated generation errors compounded
- Physical laws "drifted" from realism
This suggests the model's internal representation of physics is unstable over long sequences.
Video generation is part of a larger shift toward multimodal AI models. Storing and searching video embeddings efficiently requires vector database infrastructure like Zilliz Cloud. Organizations can also deploy Milvus as an open-source option.
Precision Requirements: Some physics require extreme precision:
- Glass fracture needs accurate modeling of stress patterns
- Mechanical systems need exact component coordination
- Fluid dynamics need accurate pressure calculations
Sora's learned representations couldn't achieve this precision—it generates "looks reasonable" outputs, not physically accurate ones.
Hand-Object Interaction:
While not purely physics, hand-object precision is related:
- Hands sometimes didn't properly contact objects
- Fine manipulation (threading needles, typing) was inaccurate
- Finger positioning violated anatomical constraints
Comparison to Game Engines:
Video game physics engines (Unreal, Unity) achieve perfect physics because they use mathematical rules:
force = mass × acceleration
torque = moment_of_inertia × angular_acceleration
Every object follows these equations exactly.
Sora's learned approach is different:
If input pattern ≈ training_scenario:
generate_statistically_likely_output()
Else:
¯\_(ツ)_/¯
Implications for Production:
What Sora Could Handle:
- People moving naturally
- Objects falling, bouncing, colliding
- Basic physics interactions (stacking, sliding)
- Long-range world consistency
What Required Workarounds:
- Complex mechanical systems → Use game engine physics, then composite with Sora footage
- Precise state changes → Use editing to show before/after rather than transition
- Glass breaking → Pre-render with proper physics engine, mention in prompt to get closest approximation
- Precise hand-object interaction → Use multiple generations, select best
The Fundamental Limitation:
Learned physics from data will always struggle with scenarios outside the training distribution. As long as video models rely on learned representations rather than explicit physics rules, certain scenarios will be problematic.
Future Direction:
Research into physics-informed neural networks (PINNs) incorporates explicit physical constraints into learned models, potentially resolving these limitations. However, this remains an open research problem.
