Infering movie plot from visuals using GPT4-V

Charlie, too, is afraid of what is coming.

A few days ago, OpenAI added a preview version of GPT4-V to their API, allowing the introduction of images to the chat endpoint. After reading about the David Attenborogh-narration proof of concept, I decided to test out this functionality by mocking up a movie-summarizer that uses exclusively the video component of the movie. For simplicity, I chose the famous Charlie Chaplin movie “Modern Times” because it is in the public domain and its plot ought to be easy to follow. For added irony, the content is more applicable than ever before.

Conceptually, the key question is how to preserve important elements contained in the individual frames and link them together. While the endpoint can be simultaneously be fed with a text prompt and multiple frames, a movie contains too many frames, even when just considering a subset. To ensure continuity, I batched a number of frames together and prompted the model with a summary of the previous batch of frames, a request for a frame-by-frame summary of persons and their actions, and finally a request for an overall summary including both.

Schematic of information flow from one batch to the next

Sampling frames from the movie

Before processing, equidistant frames were sampled from the movie using OpenCV and encoded in JPEG. A frame interval of 12.5 s seemed sufficient balance being able to capture the general plot while minimizing the number of API calls.

import cv2
frame_distance = 12.5 # in seconds
video_filename = "Modern Times (1936).mp4"

# Open the video
cap = cv2.VideoCapture(video_filename)

# Check if video opened successfully
if not cap.isOpened():
   print("Error opening video file")

# Get the frame rate of the video
fps = cap.get(cv2.CAP_PROP_FPS)

# Calculate the interval in frames (N seconds)
frame_interval = int(round(fps * frame_distance))

frame_number = 0
frames = []
while cap.isOpened():
    # Capture frame-by-frame
    ret, frame =

    if ret:
        # Check if the current frame number is at the N-second interval
        if frame_number % frame_interval == 0:
            _, buffer = cv2.imencode(".jpg", frame)
        frame_number += 1

# When everything is done, release the video capture object

In a Jupyter notebook, one can quickly flip through the frames to check whether the frame interval is appropriate to glean plot from the selected frames.

from IPython.display import Image
import time
display_handle = display(None, display_id=True)
for i, img in enumerate(frames[10:]):
    imgi = Image(data=img)

Generating a prompt

As for prompting, it proved successful to split the prompt into a part that

  • explicitly asks for frame-by-frame interpretation that includes key points that attention should be paid to to understand a movie, and one that
  • summarizes the action given the summary of the plot up to the present point.

This strategy is a bit awkward but currently required given that the chat endpoint, for expanding upon a prior context, does not support GPT4-V yet.

Between the frame-by-frame and the summary, a marker is requested for easy processing later on.

import base64
def make_prompt(summary_so_far, frames):
    prompt_text = """
    describe the following, successive frames from a movie in less than 50 words each by listing the people( refer to them by notable physical features), denote their actions expressions,
    try to infer the actions, important objects, locations and relationships. try to infer context from the historical and cultural era.
    if summary_so_far:
        prompt_text += 'so far in the movie, ' + summary_so_far
    prompt_text += '\nafter that, write MARKER and write max 500 word summary of the events so far (preserving key plot points)'
    prompt_images = [{'type': 'image', 'image': base64.b64encode(x).decode('utf-8'), 'resize': 768} for x in frames]
    return [{
        'role': 'user',
        'content': [
            { 'type': 'text', 'text': prompt_text.strip()},

Trying out GPT4-V

To try different prompting strategies, I used the completion endpoint and a relatively low token limit of 500. Especially when processing more frames, increasing the token limit is required to prevent cutting of the response prematurely.

I tried out the prompt on a set of four frames that should be straightforward to understand:

from openai import OpenAI

client = openai.OpenAI()

params = {
    "model": "gpt-4-vision-preview",
    "messages": make_prompt(False, frames[28:32]),
    "max_tokens": 500,
result =**params)

The per-frame output is shown next to the frames below:

Frame Description
Frame 1: A person with suspenders is seen looking up at a large screen displaying a man’s face, presumably an authority figure, who seems to be speaking. The environment is sterile and industrial, suggesting a controlled, possibly dystopian setting.
Frame 2: The same person with suspenders is rushing past a punch clock station, looking back as if hurried or late. The presence of a punch clock indicates a structured, time-regulated work environment.
Frame 3: The person with suspenders is working alongside a bald man at a conveyor belt, both dressed in work overalls. They are focused on the task at hand, indicating a factory setting with repetitive tasks.
Frame 4: The person with suspenders is observed attempting to keep pace with the conveyor belt process while the bald man is intently working. The factory setting seems to demand efficiency, suggesting a theme of industrialization and perhaps worker exploitation.
It appears that the film is set in the era of industrialization, perhaps in the early to mid-20th century, depicting a society that values efficiency and conformity over individuality. The stark and minimalistic design of the workplace hints at a dystopian reality where workers are closely monitored and controlled.

The storyline so far seems to follow our main character, identifiable by suspenders, who is navigating the oppressive environment of a factory. The large portrait on the screen in the first frame suggests a figure of authority—perhaps a factory owner or an overseer—emphasizing the theme of surveillance and control. This imposing figure's presence looms over the workers, creating an atmosphere of intimidation and possibly fear.

In the second frame, the worker's rush and glance back at the punch clock conveys a sense of urgency, underscoring the strict regimen imposed upon the workers. Being late or out of sync with the schedule is likely frowned upon or penalized, hinting at the rigid and demanding nature of their work-life.

The third and fourth frames depict a typical day on the job for the workers, who are engaged in monotonous and physically demanding labor. The sense of camaraderie or solidarity between the person with suspenders and the bald man is uncertain; however, their shared circumstance suggests they are both cogs in the vast machinery of the industrial complex. The focus on the production line and the laborer's interaction with it accentuates the dehumanizing effect of the work environment.

The narrative so far seems to cover typical aspects of the industrialization period's workforce—the regimentation of workers, the importance of punctuality and productivity, as well as the hierarchical structure that guides the workers' daily routines. The difference in how the workers are portrayed might hint at their differing stations within this environment or the commencement of a character's arc where they learn the ropes of the job. The overarching context posits a commentary on the industrial era's impacts on individuals and their interactions within the workspace.

The frame-by-frame descriptions closely match what can be seen on-screen and follow the prompt. Identifying Charlie as “the person with suspenders” and establishing continuity throughout the shots, works reliably. Perhaps a bit of tinkering with the length of the summary, or ratio between frames to be described and summary length, will force more extraction of story from the frame, rather than permit a full re-telling.

Getting the full summary by scanning through all of the frames

For getting through the entirety of the frames, we iterate through all frame batches and keep feeding the previous summary into the prompt.

summaries = []
framei = 10

while framei < len(frames):
    params = {
        "model": "gpt-4-vision-preview",
        "messages": make_prompt(
            False if len(summaries) == 0 else summaries[-1].split('MARKER')[-1],
        "max_tokens": 750,

    result =**params)
    framei += 10

Shown below is the final summary:

The film follows a mustached everyman wearing a bowler hat, symbolizing the common man's resilience amid the early industrial era's harsh realities. It begins with him enduring the relentless pace of an assembly line, leading to a mental breakdown. Mistaken for a protest ringleader, he is unjustly arrested.

Upon release, the man meets a destitute woman with dark hair. Together, they navigate the Great Depression, presence of unemployment and ever-watchful law enforcement. The duo's bond strengthens as they face numerous comedic setbacks in their quest for stability, including disastrous stints as restaurant waitstaff and stage performers. Their struggles reflect the period's disregard for the individual in favor of industrial progress.

Recently, an unsuccessful performance at a restaurant involved the man humorously faking a singing act. This episode leads to a confrontation with a stout figure representing authority, revealing the ever-present societal tensions. The pair then become entangled in a restaurant disturbance, hinting at a chaotic escape or brawl, further illustrating their ongoing confrontation with oppressive norms.

As their journey progresses, we see a shift to a hillside road where the two characters find a moment of reprieve. Their subsequent actions, including moments of gentle reassurance and shown solidarity, suggests that despite the setbacks, hope persists. An intertitle message emphasizes their determination, reinforcing the film's central themes.

The full summary injects a lot of overinterpretation into the source material, but does not stray too far from the plot, given that only one per 300 frames was used. Continuity of the action is successfully transfered between batches of frames. Using a larger context window or batching due to visual similarity (i.e. per scene) might prove more successful.

Given with how little effort and money (I spent $2.67 on API calls) I spent to evaluate one hour of video material, there are a lot of amazing and scary things that will be built with this.