Understanding Content-Based Filtering in Recommendation Systems

How Netflix Suggests Movies Based on Your Preferences

Darshana

Feb 02, 2025

Hey this is Darshana!

Welcome back to my AI series, where I break down complex concepts into simple, relatable examples.

So far, we have explored common machine learning applications and taken a closer look at sentiment analysis.

We then moved on to understanding recommendation systems, starting with collaborative filtering.

In this article, we will explore the role of content filtering in recommendation systems.

If you’ve missed my previous posts, feel free to start here!

AI Simplified: Understanding Its Power and Potential

Darshana Jaisingh

December 23, 2024

AI Simplified: Understanding Its Power and Potential

Read full story

If you have any suggestions or topics you’d like me to simplify with examples, please let me know in the comments.

Before we start, let me give you a brief recap of what Collaborative Filtering is. It is used to predict a user’s interest by finding other users with similar preferences. So, if User A watched Stranger Things and User B also watched it, the AI system might suggest something to User B when User A watches something new, assuming their preferences are similar.

While collaborative filtering is based on user interests, content filtering focuses on identifying the features of the content and matching them with user preferences.

Let’s consider an example. You might be familiar with the Spotify app.

Imagine you frequently listen to soft acoustic songs in the evenings and high-energy workout music in the mornings. Spotify uses content-based filtering to analyze the features of the songs you play, such as:

Genre: Acoustic, Rock, Pop, EDM
Mood/Ambience: Relaxing, Upbeat, Melancholic, Energetic
Artist & Album: Ed Sheeran, Coldplay, Billie Eilish
Beats Per Minute (BPM): Slow (60–90 BPM) vs. Fast (120+ BPM)
Instrumentation: Acoustic Guitar, Synth, Heavy Bass

Now, let’s say you mostly listen to Ed Sheeran, John Mayer, and Coldplay at night. If a new song with similar acoustic tones and mellow beats is released by an artist you haven’t heard before — like James Bay — Spotify might recommend it in your evening playlist because it matches the features of the songs you already like.

Similarly, if you often listen to high-energy EDM during workouts, Spotify may filter out slower ballads when suggesting songs for your workout playlist. This is content-based filtering — recommendations are made based on song features.

In our previous use case with collaborative filtering, we explored how Netflix uses it in its recommendation model. To keep the flow simple and easy to understand, we’ll use the same example to explain content-based filtering.

Content-based filtering in Netflix

Netflix suggests movies and shows by analyzing their properties and matching them with your preferences. It assumes that if you liked one item, you would enjoy similar items.

Properties can be anything that describes the content. In terms of movies and shows, some of the properties might include:

Genre (Action, Comedy, Drama, etc.)
Cast (Leonardo DiCaprio, Robert Downey Jr., etc.)
Director (Christopher Nolan, Steven Spielberg, etc.)
Language (English, Spanish, etc.)
Keywords (sci-fi, time travel, romance, etc.)

Photo by Samuel Regan-Asante on Unsplash

If you watched and liked “Inception,” Netflix might start suggesting you more Sci-Fi, Thriller, and Christopher Nolan movies.

For instance, Netflix might suggest “Interstellar” because they share common features: Sci-Fi, Christopher Nolan, and a thought-provoking story.

So here Netflix is making these recommendations based on movie features, not on what other users watched.

You might wonder how this is achieved. Computers don’t understand themes, languages, or genres the way we do. Instead, they process everything in 0s and 1s.

This is achieved by representing each movie/show as a vector (a list of numbers) based on different features.

Let’s say we have three movies:

If a feature is present, we mark it as 1, otherwise 0

“Inception” is Sci-Fi (1, 0, 0) and directed by Nolan (1, 0, 0).

“Django” is an Action movie by Tarantino and “Titanic” is a Drama by Cameron.

Now every movie is a vector of numbers — this is how the AI “sees” movies.

Now, let’s say you, as a user, watched and liked “Inception” and “Interstellar”.

Your preference is computed as the average of the features of those movies:

“Inception” → Sci-Fi (1, 0, 0), Nolan (1, 0, 0), DiCaprio (1, 0, 0)
“Interstellar” → Sci-Fi (1, 0, 0), Nolan (1, 0, 0), (Different lead actor)

If we average these vectors, we get your preference (user profile) as:
User Profile: (1, 0, 0) for Sci-Fi, (1, 0, 0) for Nolan, (0.5, 0, 0.5) for actors.

What Does This Mean?

You have a strong preference for Sci-Fi and Christopher Nolan movies.
You slightly prefer DiCaprio but are open to other actors.

Finding the next best Recommendation

Now, to find the next best recommendation for you, Netflix uses techniques like TF-IDF (Term Frequency-Inverse Document Frequency) and Cosine Similarity. Let me briefly explain these to you.

TF-IDF (Term Frequency-Inverse Document Frequency)

TF stands for Term Frequency. It evaluates the importance of a word based on how often it appears. For example, in an article about cats, the word “cat” might appear more than 10 times, while the word “dog” might only appear once, indicating that “cat” is more important.

However, this can result in common filler words like “a,” “the,” or “an” being treated as important, even though they don’t add much value. To address this, we use IDF.

IDF stands for Inverse Document Frequency. It evaluates how rare a word is across documents. For example, if you read 100 restaurant reviews, words like “food” and “tasty” appear everywhere, so they don’t carry much weight. But if one review mentions “spicy butter chicken,” that stands out.

TF-IDF combines both TF and IDF to find those unique words that truly define a document!

To sum up, TF-IDF helps identify the most important words by filtering out common ones and highlighting the unique ones. It’s useful in search engines, text analysis, and recommendation systems to better understand what the content is really about.

Cosine Similarity

Imagine you have two movie reviews:

“This movie is exciting, thrilling, and full of action.”
“An action-packed and thrilling movie with lots of excitement.”

Even though the words are arranged differently, both reviews are talking about the same thing. Cosine Similarity helps measure how similar two documents are, even if they use different wording.

It treats each document like a set of numbers (word frequencies) and calculates the angle between them — smaller angles mean more similarity.

Now that we’ve understood the above techniques, lets continue with our search for the “next movie recommendation”.

The AI (recommendation system)compares your profile/preferences to all available movies and picks the most similar ones using cosine similarity.

You like Sci-Fi (1,0,0) and Nolan (1,0,0).

Movie: Tenet (Sci-Fi, Nolan, John David Washington) → (1,0,0), (1,0,0), (0,1,0)
Cosine Similarity score between your profile and “Tenet” → HIGH ✅
Movie: Django (Western, Tarantino, Pitt) → (0,1,0), (0,1,0), (0,1,0)
Cosine Similarity → LOW ❌

Netflix will rank movies by similarity and recommend “Tenet” first.

Now moving ahead we realize, genres and actors are good, but movie descriptions also matter. How do we improve the recommendation model?

Netflix uses TF-IDF (Term Frequency — Inverse Document Frequency) to rank how important words are in a movie’s metadata.

For example,

Movie 1: “A mind-bending sci-fi thriller about dreams and reality.”
Movie 2: “A romantic drama set on the Titanic.”

TF-IDF assigns higher weight to unique words (“mind-bending”, “sci-fi”) and lower weight to common words (“a”, “about”).

If you watched Sci-Fi movies with “mind-bending” themes, Netflix will look for other movies where “mind-bending” is an important keyword.

So basically, this is how Netflix recommends movies and shows based on content filtering. However, like everything, it does come with its own issues.

Problem 1: Cold Start (New Users)

If you’re a new user, there’s no history. The solution in this case is that Netflix would show trending/popular content first and slowly adapt to your choices.

Problem 2: Content Filtering is Too Narrow

If you only watch Sci-Fi, you might never see Comedy. The solution for this is Hybrid models that mix collaborative filtering (what others like) with content filtering.

So, when you combine both Collaborative and Content-based filtering, that’s the gist of how recommendation systems work.

If you found this article valuable and want to join in this journey with me, subscribe to my Substack channel.
Feel free to share this article with anyone you think would benefit from it. Your support means the world to me, and I look forward to bringing you more insightful content.

Thanks for reading The Dev Insight Journal! This post is public so feel free to share it.

The Dev Insight Journal