Meta's new AI model tags and tracks every object in your videos

Meta has a new AI model that can label and follow any object in a video as it moves around. The Segment Anything Model 2 (SAM 2) extends the capabilities of its predecessor, SAM, which was limited to images, opening up new opportunities for video editing and analysis.

SAM 2’s real-time segmentation is a potentially huge technical leap. It showcases how AI can process moving images and distinguish among the elements on screen even as they move around or out of the frame and back in again.

Segmentation is the term for how software determines which pixels in an image belong to which objects. An AI assistant that can do so makes it a lot easier to process or edit complicated images. That was the breakthrough of Meta’s original SAM. SAM has helped segment sonar images of coral reefs, parsed satellite images to aid disaster relief efforts, and even analyzed cellular images to detect skin cancer.

SAM 2 widens the video capacity, which is no small feat and would not have been feasible until very recently. As part of SAM 2’s debut, Meta shared a database of 50,000 videos created to train the model. That’s on top of the 100,000 other videos Meta mentioned employing. Along with all the training data, real-time video segmentation takes a significant amount of computing power, so while SAM 2 is open and free at the moment, it likely won’t stay that way forever.

Meta SAM 2

(Image credit: Meta)

Segment Success

Using SAM 2, video editors could isolate and manipulate objects within a scene more easily than the limited abilities of current editing software and far beyond manually adjusting each frame. Meta envisions SAM 2 revolutionizing interactive video, too. Users could select and manipulate objects within live videos or virtual spaces thanks to the AI model.

Meta thinks SAM 2 could also play a crucial role in the development and training of computer vision systems, particularly in autonomous vehicles. Accurate and efficient object tracking is essential for these systems to interpret and navigate their environments safely. SAM 2’s capabilities could expedite the annotation process of visual data, providing high-quality training data for these AI systems.

A lot of the AI video hype is around generating videos from text prompts. Models like OpenAI’s Sora, Runway, and Google Veo get a lot of attention for a reason. Still, the kind of editing ability provided by SAM 2 might play an even bigger role in embedding AI in video creation.

And, while Meta might have an edge now, other AI video developers are keen on producing their own version. For instance, Google’s recent research has led to video summarization and object recognition features that it is testing on YouTube. Adobe and its Firefly AI tools are also centered on photo and video editing and include content-aware fill and auto-reframe features.