Dokumentasi lengkap tentang implementasi Information Retrieval System untuk Netflix Movies and TV Shows
Netflix Search Engine adalah aplikasi web berbasis Information Retrieval yang memungkinkan pengguna untuk mencari film dan acara TV dari dataset Netflix. Sistem ini mengimplementasikan tiga komponen utama dari Information Retrieval:
Sistem menggunakan dataset Netflix Movies and TV Shows yang berisi 7,787 item dengan informasi lengkap seperti judul, deskripsi, genre, tahun rilis, durasi, dan rating.
Sistem dibangun dengan arsitektur berbasis web menggunakan Flask framework:
Proses pembersihan dan normalisasi teks sebelum indexing:
def _preprocess_text(self, text: str) -> List[str]:
"""
Preprocess text: lowercase, remove punctuation, tokenize
"""
if not text or text == 'nan':
return []
# Convert to lowercase
text = text.lower()
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Tokenize
tokens = text.split()
# Remove stopwords
stopwords = {
'a', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'for', 'from',
'has', 'he', 'in', 'is', 'it', 'its', 'of', 'on', 'that', 'the',
'to', 'was', 'will', 'with'
}
tokens = [token for token in tokens if token not in stopwords and len(token) > 1]
return tokens
Inverted Index adalah struktur data yang memetakan setiap term (kata) ke daftar dokumen yang mengandung term tersebut. Ini adalah fondasi dari semua sistem pencarian modern karena memungkinkan pencarian yang sangat efisien.
Contoh:
def _build_inverted_index(self):
"""
Build inverted index: term -> set of document IDs
"""
print("Building inverted index...")
for idx, doc in enumerate(self.documents):
# Combine title, description, and genres for indexing
text = f"{doc['title']} {doc['description']} {doc['listed_in']}"
tokens = self._preprocess_text(text)
# Add terms to inverted index
for term in set(tokens): # Use set to avoid duplicates
self.inverted_index[term].add(idx)
print(f"Inverted index built with {len(self.inverted_index)} terms")
Dalam Vector Space Model, setiap dokumen dan query direpresentasikan sebagai vektor dalam ruang multi-dimensi, dimana setiap dimensi merepresentasikan sebuah term.
TF-IDF adalah skema pembobotan yang mengukur pentingnya sebuah term dalam dokumen relatif terhadap koleksi dokumen.
Interpretasi: Seberapa sering term muncul dalam dokumen (normalized).
def _calculate_tf(self, tokens: List[str]) -> Dict[str, float]:
"""
Calculate Term Frequency (TF)
TF = (number of times term appears) / (total terms)
"""
tf = {}
total_terms = len(tokens)
if total_terms == 0:
return tf
term_counts = Counter(tokens)
for term, count in term_counts.items():
tf[term] = count / total_terms
return tf
Interpretasi: Seberapa unik/penting term tersebut dalam koleksi dokumen.
def _calculate_idf(self):
"""
Calculate Inverse Document Frequency (IDF)
IDF = log(total documents / documents containing term)
"""
total_docs = len(self.documents)
for term, doc_ids in self.inverted_index.items():
self.idf[term] = math.log(total_docs / len(doc_ids))
Interpretasi: Term yang sering muncul dalam dokumen tertentu tetapi jarang dalam koleksi dokumen lainnya akan mendapat score tinggi.
def _calculate_tf_idf(self):
"""
Calculate TF-IDF for all documents
TF-IDF = TF * IDF
"""
# Calculate IDF
self._calculate_idf()
# Calculate TF-IDF for each document
for idx, doc in enumerate(self.documents):
text = f"{doc['title']} {doc['description']} {doc['listed_in']}"
tokens = self._preprocess_text(text)
# Calculate TF
tf = self._calculate_tf(tokens)
# Calculate TF-IDF
tf_idf_vector = {}
for term, tf_value in tf.items():
if term in self.idf:
tf_idf_vector[term] = tf_value * self.idf[term]
self.document_vectors[idx] = tf_idf_vector
Misalkan ada 3 dokumen:
Untuk term "action" di Doc 1:
Untuk term "romantic" di Doc 3:
Term "romantic" mendapat score lebih tinggi karena lebih unik (hanya muncul di 1 dokumen).
Cosine Similarity mengukur similaritas antara dua vektor dengan menghitung cosinus dari sudut antara kedua vektor. Nilai berkisar dari 0 (tidak mirip) hingga 1 (identik).
1. Dot Product:
2. Magnitude:
def _cosine_similarity(self, vec1: Dict[str, float], vec2: Dict[str, float]) -> float:
"""
Calculate cosine similarity between two vectors
Cosine Similarity = (A Ā· B) / (||A|| * ||B||)
"""
# Calculate dot product
dot_product = sum(
vec1.get(term, 0) * vec2.get(term, 0)
for term in set(vec1.keys()) | set(vec2.keys())
)
# Calculate magnitudes
magnitude1 = math.sqrt(sum(val ** 2 for val in vec1.values()))
magnitude2 = math.sqrt(sum(val ** 2 for val in vec2.values()))
if magnitude1 == 0 or magnitude2 == 0:
return 0.0
return dot_product / (magnitude1 * magnitude2)
Query Vector: {action: 0.5, thriller: 0.3}
Document Vector: {action: 0.4, thriller: 0.2, movie: 0.1}
Query Expansion adalah teknik untuk memperbaiki hasil pencarian dengan menambahkan term-term yang relevan ke query asli pengguna. Ini membantu mengatasi masalah vocabulary mismatch antara query pengguna dan dokumen.
def _expand_query(self, query: str) -> List[str]:
"""
Query Expansion: Add related terms based on similar documents
"""
query_tokens = self._preprocess_text(query)
if not query_tokens:
return query_tokens
# Find documents that contain any query term
candidate_docs = set()
for term in query_tokens:
if term in self.inverted_index:
candidate_docs.update(self.inverted_index[term])
if not candidate_docs:
return query_tokens
# Find most frequent terms in top candidate documents
term_freq = Counter()
for doc_id in list(candidate_docs)[:20]: # Use top 20 candidates
doc = self.documents[doc_id]
text = f"{doc['title']} {doc['description']} {doc['listed_in']}"
tokens = self._preprocess_text(text)
term_freq.update(tokens)
# Add top 3 most frequent terms not in original query
expanded_terms = []
for term, _ in term_freq.most_common(10):
if term not in query_tokens and len(term) > 2:
expanded_terms.append(term)
if len(expanded_terms) >= 3:
break
# Combine original query with expanded terms
return query_tokens + expanded_terms
User Query
ā
[Preprocessing]
ā
[Query Expansion]
ā
[TF-IDF Vectorization]
ā
[Inverted Index Lookup] ā Candidate Documents
ā
[Cosine Similarity Calculation]
ā
[Ranking by Score]
ā
Top-K Results ā Display to User
Core logic dari search engine, berisi class NetflixSearchEngine dengan method:
class NetflixSearchEngine:
def __init__(self, csv_path: str)
# Initialize dan load data
def _load_data(self, csv_path: str)
# Load CSV file
def _preprocess_text(self, text: str) -> List[str]
# Text preprocessing
def _build_inverted_index(self)
# Build inverted index structure
def _calculate_tf(self, tokens: List[str]) -> Dict[str, float]
# Calculate Term Frequency
def _calculate_idf(self)
# Calculate Inverse Document Frequency
def _calculate_tf_idf(self)
# Calculate TF-IDF for all documents
def _cosine_similarity(self, vec1, vec2) -> float
# Calculate cosine similarity
def _expand_query(self, query: str) -> List[str]
# Query expansion logic
def search(self, query: str, top_k: int = 10) -> Dict
# Main search function
def get_statistics(self) -> Dict
# Get system statistics
Flask application untuk web interface:
@app.route('/')
def index():
# Render halaman utama
@app.route('/search', methods=['POST'])
def search():
# Handle search request
# Return JSON results
@app.route('/stats')
def stats():
# Return system statistics
@app.route('/documentation')
def documentation():
# Render halaman dokumentasi
| Operasi | Time Complexity | Space Complexity |
|---|---|---|
| Build Inverted Index | O(N Ć M) | O(T Ć D) |
| Calculate TF-IDF | O(N Ć M) | O(N Ć M) |
| Search Query | O(K Ć M) | O(M) |
| Cosine Similarity | O(M) | O(1) |
Keterangan: N = jumlah dokumen, M = rata-rata jumlah term per dokumen, T = total unique terms, D = jumlah dokumen yang mengandung term, K = jumlah candidate documents
Tim yang mengerjakan project ini:
Andika Tri Juni S
Dwiki Likuisa
Dandi Agus T
Les Endahti
Dwi Angesti Dinda P
Teknologi yang digunakan dalam project ini:
Python 3.8+
Flask
HTML5/CSS3
Bootstrap 5
JavaScript
CSV
MySQL (optional)
math, csv
collections
Netflix Search Engine - Information Retrieval System
Implementasi Inverted Index, Vector Space Model, Cosine Similarity & Query Expansion