Automatic TV Show Intro Detection

Posted: July 23, 2025

Last modified: July 24, 2025

Tags: python computer-vision intro-detection

TLDR

The code used for this can be found in this github repo at this commit

Using MS-SSIM ( Multi Scale Strucutural Similarity ) to calculate the similarity between a known frame and the last frame of an intro in DBZ Kai allowed me to figure out where the intro sequence was in a subset of the pre Boo Saga DBZ Kai episodes.

The Single-Core performance operates at about 33 fps and using 4 cores to parallelize the process provides a small increase in performance to 52 fps

Using a visual based process to find intro’s seems to be unreliable with a series that has multiple different visuals while maintaining, more or less, the same audio track.

Motivation

Recently, I was re-watching Dragon Ball Z Kai and noticed that the intro song, Dragon Soul, had several different singers throughout various parts of the show. I didn’t recall this from my childhood and only remember one specific singer, Vic Mignogna. Since I already had all the episodes downloaded, I was thinking about replacing the audio track of the intro from each episode with the audio track which had Vic singing the theme song.

To do this, I’ll use ffmpeg to perform the actual audio replacement, but I need to specify the start of the intro for each episode. Not all episodes start rolling the intro at the same time so I’ll need to detect the intro for each episode on its own.

Methodology

The basic concept for both methods I used for intro detection is having a frame from the video from either the very beginning of the intro or the very end and then going through each frame in each episode and computing the similarity of the reference frame to each frame in the episode. I’ll be using MS-SSIM ( Multi Scale Strucutural Similarity ) to calculate frame similarity. When there is a sufficiently high level of similarity, that’s a tell that the the program has found a match, and we can use the index of the matched frame and the FPS to calculate the timestamp where the match happens.

First Frame Based Intro Detection

For this method, the first frame of the intro is detected. For specifically the DBZ Kai Pre Boo Saga episodes this method poses a slight problem. The first frame of the intro is completely black, there is about a second or two of just black frames with no audio that then evolved into the first visual frame. So instead, I used the first frame that wasn’t just black to calculate the start and end position. Because there is a slight offset, the audio was not completely in sync.

Last Frame Based Intro Detection

For this method the program will loop through the frames until there’s a large drop-off in frame similarity. The idea is that for some frames at the end of the intro, there will be very high similarity between the reference frame and the frame the program is currently on. The frame directly after is probably not going to be similar at all to the end of the intro so there will be a steep drop-off of similarity. Using this drop-off we can figure out when the intro starts and ends more reliably.

Click here for full res MS-SSIM Dropoff

High-Level Implementation

First Frame Based Intro Detection

A pretty simple process.

Read a frame from the cv2 Video Capture, reshape it if necessary.
Calculate the MS-SSIM score. Add that score to a list of scores.
If the current score is greater than 0.8, we’ve found the intro.
Using the FPS, I can calculate the timestamp in seconds and then turn that into a MM:SS time for the first frame of the

Last Frame Based Intro Detection

Pretty similar to the First Frame based process except step 3 would be:

If the difference between the previous and current score is greater than 0.5, we’ve found the last frame of the intro.

A Very Naive Multi Core Version of Last Frame Based Intro Detection

In this implementation I’m using the multiprocessing library to make use of multiple cores (4), and some data structures to share data. I’m operating in batches of 1000 frames.

Main Process:

Loop Forever
Read 1000 frames
Convert each to grayscale
Push the index of the frame and frame data into a queue
Launch the cores
Loop over the calculated scores and see if the previous - current score is greater than 0.5

Consumer Processes:

Read an index and frame from queue as long as there are values to be read
Calculate MS-SSIM score and put it into shared array at the index that was popped off the queue
End once there are no more frames to be read

Performance

Each implementation was tested on S01E38 of Dragon Ball Z Kai so that the intro was offset into the video a little bit to get a better idea of the first and last frame implementations

Specs of Machine

CPU: AMD Ryzen 7 5800X3D
RAM: 32 GB

Video Specs

Attribute	Value
Resolution	1440x1080
Encoding	h264 (High)
Colorspace	yuv420p (progressive)
FPS	23.98
Bits Per Second?	4907332 -> 613.4

These are very naive implementations that are not optimized and there is for sure performance left on the table.

First Frame Based Intro Detection

Metric	Value
Time Of Intro	00:26
Frame	624
Runtime	39s
FPS	16

I noticed that this FPS seems fairly low, It’s because I realized the reference frame I was using had a resolution of 1440x1080. Let’s see what the performance looks like when the reference image is 960x720, just as it is for the Last Frame based implementations.

Metric	Value
Time Of Intro	00:26
Frame	624
Runtime	19s
FPS	32.8

Shrinking the images that ssim operates on clearly shows an improvement in performance.

Metric	Value	Increase/Decrease
Runtime	51%	Decrease
FPS	105%	Increase

Last Frame Based Intro Detection

Metric	Value
Time Of Intro	1:51
Frame	2664
Runtime	1:22
FPS	32.5

Naive Multi Core Implementation

Metric	Value
Time Of Intro	1:51
Frame	2664
Runtime	51s
FPS	52.2
FPS/Core	13
Cores	4

Limitations

Accuracy

The first frame based method at the moment is not entirely accurate. Because the DBZ Kai intro first has around a second of black frames before starting the intro, when I was replacing the audio track, the audio didn’t sync up. Realistically, this could probably be solved by offsetting the calculated frame with the number of frames that there is a black frame on screen. This seems a bit simplistic and I imagine there could be some potential problems with this method which is what led me to use the last frame-based method.

Methodology

While testing on a few episodes I realized that my script was failing on S01E64. Upon further inspection I saw that the last frame of the was not the same as S01E1 and S01E38. This means that this method would only work on a subset of episodes which means I should find a better method of recognizing the end of an intro based on something other than just frame similarity. Maybe something based on audio would be a better method.

Multi Core

My multi core implementation is very very bad. I get a marginally better time overall but per core the rate decreases by a lot. I’m not very familiar with using multiprocessing so mostly this is a skill issue. I imagine the multiprocessing queue has some sort of synchronization/built-in locking so that could potentially be holding back some performance. For next steps I want to be able to use a lock free method for grabbing frames and storing scores.