Dokumentasi Search Engine

Dokumentasi lengkap tentang implementasi Information Retrieval System untuk Netflix Movies and TV Shows

Daftar Isi

1. Overview System
2. Arsitektur Sistem
3. Algoritma yang Digunakan
4. Inverted Index
5. Vector Space Model & TF-IDF
6. Cosine Similarity
7. Query Expansion
8. Alur Kerja Sistem
9. Struktur Kode

1. Overview System

Netflix Search Engine adalah aplikasi web berbasis Information Retrieval yang memungkinkan pengguna untuk mencari film dan acara TV dari dataset Netflix. Sistem ini mengimplementasikan tiga komponen utama dari Information Retrieval:

Inverted Index - Struktur data efisien untuk pencarian cepat
Vector Space Model - Representasi dokumen menggunakan TF-IDF
Cosine Similarity - Pengukuran relevansi antara query dan dokumen
Query Expansion - Perluasan query untuk hasil yang lebih baik

Dataset

Sistem menggunakan dataset Netflix Movies and TV Shows yang berisi 7,787 item dengan informasi lengkap seperti judul, deskripsi, genre, tahun rilis, durasi, dan rating.

2. Arsitektur Sistem

Sistem dibangun dengan arsitektur berbasis web menggunakan Flask framework:

Frontend (UI Layer) HTML, CSS, Bootstrap 5, JavaScript untuk antarmuka pengguna yang responsif

Backend (Application Layer) Flask Python untuk handling HTTP requests dan routing

Search Engine (Core Logic) NetflixSearchEngine class dengan implementasi algoritma IR

Data Layer CSV file sebagai sumber data, dengan opsi MySQL database

3. Algoritma yang Digunakan

📊 Text Preprocessing

Proses pembersihan dan normalisasi teks sebelum indexing:

Lowercasing: Mengubah semua karakter menjadi huruf kecil
Punctuation Removal: Menghapus tanda baca
Tokenization: Memecah teks menjadi token (kata-kata)
Stopword Removal: Menghapus kata-kata umum yang tidak informatif

search_engine.py - _preprocess_text()

def _preprocess_text(self, text: str) -> List[str]:
    """
    Preprocess text: lowercase, remove punctuation, tokenize
    """
    if not text or text == 'nan':
        return []
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Tokenize
    tokens = text.split()
    
    # Remove stopwords
    stopwords = {
        'a', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'for', 'from',
        'has', 'he', 'in', 'is', 'it', 'its', 'of', 'on', 'that', 'the',
        'to', 'was', 'will', 'with'
    }
    
    tokens = [token for token in tokens if token not in stopwords and len(token) > 1]
    
    return tokens

4. Inverted Index

Konsep

Inverted Index adalah struktur data yang memetakan setiap term (kata) ke daftar dokumen yang mengandung term tersebut. Ini adalah fondasi dari semua sistem pencarian modern karena memungkinkan pencarian yang sangat efisien.

Struktur Data

Inverted Index: term → {doc_id₁, doc_id₂, doc_id₃, ...}

Contoh:

"action" → {5, 23, 45, 67, 89}
"horror" → {12, 34, 56, 78}
"comedy" → {8, 15, 22, 29, 36}

Kompleksitas Waktu

Tanpa Inverted Index: O(n) - harus scan semua dokumen
Dengan Inverted Index: O(1) - lookup langsung ke term

Implementasi

search_engine.py - _build_inverted_index()

def _build_inverted_index(self):
    """
    Build inverted index: term -> set of document IDs
    """
    print("Building inverted index...")
    
    for idx, doc in enumerate(self.documents):
        # Combine title, description, and genres for indexing
        text = f"{doc['title']} {doc['description']} {doc['listed_in']}"
        tokens = self._preprocess_text(text)
        
        # Add terms to inverted index
        for term in set(tokens):  # Use set to avoid duplicates
            self.inverted_index[term].add(idx)
    
    print(f"Inverted index built with {len(self.inverted_index)} terms")

Keuntungan

Pencarian sangat cepat bahkan untuk dataset besar
Efisien dalam penggunaan memori
Mudah di-update ketika ada dokumen baru
Mendukung query dengan multiple terms

5. Vector Space Model & TF-IDF

Konsep Vector Space Model

Dalam Vector Space Model, setiap dokumen dan query direpresentasikan sebagai vektor dalam ruang multi-dimensi, dimana setiap dimensi merepresentasikan sebuah term.

TF-IDF (Term Frequency - Inverse Document Frequency)

TF-IDF adalah skema pembobotan yang mengukur pentingnya sebuah term dalam dokumen relatif terhadap koleksi dokumen.

1. Term Frequency (TF)

TF(t, d) = (jumlah kemunculan term t dalam dokumen d) / (total term dalam dokumen d)

Interpretasi: Seberapa sering term muncul dalam dokumen (normalized).

search_engine.py - _calculate_tf()

def _calculate_tf(self, tokens: List[str]) -> Dict[str, float]:
    """
    Calculate Term Frequency (TF)
    TF = (number of times term appears) / (total terms)
    """
    tf = {}
    total_terms = len(tokens)
    
    if total_terms == 0:
        return tf
    
    term_counts = Counter(tokens)
    
    for term, count in term_counts.items():
        tf[term] = count / total_terms
    
    return tf

2. Inverse Document Frequency (IDF)

IDF(t) = log(total dokumen / jumlah dokumen yang mengandung term t)

Interpretasi: Seberapa unik/penting term tersebut dalam koleksi dokumen.

search_engine.py - _calculate_idf()

def _calculate_idf(self):
    """
    Calculate Inverse Document Frequency (IDF)
    IDF = log(total documents / documents containing term)
    """
    total_docs = len(self.documents)
    
    for term, doc_ids in self.inverted_index.items():
        self.idf[term] = math.log(total_docs / len(doc_ids))

3. TF-IDF Score

TF-IDF(t, d) = TF(t, d) × IDF(t)

Interpretasi: Term yang sering muncul dalam dokumen tertentu tetapi jarang dalam koleksi dokumen lainnya akan mendapat score tinggi.

search_engine.py - _calculate_tf_idf()

def _calculate_tf_idf(self):
    """
    Calculate TF-IDF for all documents
    TF-IDF = TF * IDF
    """
    # Calculate IDF
    self._calculate_idf()
    
    # Calculate TF-IDF for each document
    for idx, doc in enumerate(self.documents):
        text = f"{doc['title']} {doc['description']} {doc['listed_in']}"
        tokens = self._preprocess_text(text)
        
        # Calculate TF
        tf = self._calculate_tf(tokens)
        
        # Calculate TF-IDF
        tf_idf_vector = {}
        for term, tf_value in tf.items():
            if term in self.idf:
                tf_idf_vector[term] = tf_value * self.idf[term]
        
        self.document_vectors[idx] = tf_idf_vector

Contoh Perhitungan

Misalkan ada 3 dokumen:

Doc 1: "action adventure movie" (3 terms)
Doc 2: "action thriller movie" (3 terms)
Doc 3: "romantic comedy film" (3 terms)

Untuk term "action" di Doc 1:

TF = 1/3 = 0.333
IDF = log(3/2) = 0.176
TF-IDF = 0.333 × 0.176 = 0.059

Untuk term "romantic" di Doc 3:

TF = 1/3 = 0.333
IDF = log(3/1) = 0.477
TF-IDF = 0.333 × 0.477 = 0.159

Term "romantic" mendapat score lebih tinggi karena lebih unik (hanya muncul di 1 dokumen).

6. Cosine Similarity

Konsep

Cosine Similarity mengukur similaritas antara dua vektor dengan menghitung cosinus dari sudut antara kedua vektor. Nilai berkisar dari 0 (tidak mirip) hingga 1 (identik).

Formula

Cosine Similarity = (A · B) / (||A|| × ||B||)

Dimana:
• A · B = dot product dari vektor A dan B
• ||A|| = magnitude (panjang) vektor A
• ||B|| = magnitude (panjang) vektor B

Perhitungan Detail

1. Dot Product:

A · B = Σ(Aᵢ × Bᵢ) untuk semua dimensi i

2. Magnitude:

||A|| = √(Σ Aᵢ²)
||B|| = √(Σ Bᵢ²)

Implementasi

search_engine.py - _cosine_similarity()

def _cosine_similarity(self, vec1: Dict[str, float], vec2: Dict[str, float]) -> float:
    """
    Calculate cosine similarity between two vectors
    Cosine Similarity = (A · B) / (||A|| * ||B||)
    """
    # Calculate dot product
    dot_product = sum(
        vec1.get(term, 0) * vec2.get(term, 0) 
        for term in set(vec1.keys()) | set(vec2.keys())
    )
    
    # Calculate magnitudes
    magnitude1 = math.sqrt(sum(val ** 2 for val in vec1.values()))
    magnitude2 = math.sqrt(sum(val ** 2 for val in vec2.values()))
    
    if magnitude1 == 0 or magnitude2 == 0:
        return 0.0
    
    return dot_product / (magnitude1 * magnitude2)

Contoh Perhitungan

Query Vector: {action: 0.5, thriller: 0.3}

Document Vector: {action: 0.4, thriller: 0.2, movie: 0.1}

Dot Product = (0.5 × 0.4) + (0.3 × 0.2) + (0 × 0.1) = 0.26
||Query|| = √(0.5² + 0.3²) = √0.34 = 0.583
||Doc|| = √(0.4² + 0.2² + 0.1²) = √0.21 = 0.458
Cosine Similarity = 0.26 / (0.583 × 0.458) = 0.973

Keuntungan Cosine Similarity

Tidak terpengaruh oleh panjang dokumen
Fokus pada orientasi/direction vektor, bukan magnitude
Efektif untuk text similarity
Nilai normalized (0 sampai 1)

7. Query Expansion

Konsep

Query Expansion adalah teknik untuk memperbaiki hasil pencarian dengan menambahkan term-term yang relevan ke query asli pengguna. Ini membantu mengatasi masalah vocabulary mismatch antara query pengguna dan dokumen.

Algoritma

Step 1: Initial Search Cari dokumen yang mengandung minimal satu term dari query asli

Step 2: Term Frequency Analysis Analisis term yang paling sering muncul di top candidate documents

Step 3: Term Selection Pilih top-N term yang belum ada di query asli

Step 4: Query Reconstruction Gabungkan query asli dengan term baru

Implementasi

search_engine.py - _expand_query()

def _expand_query(self, query: str) -> List[str]:
    """
    Query Expansion: Add related terms based on similar documents
    """
    query_tokens = self._preprocess_text(query)
    
    if not query_tokens:
        return query_tokens
    
    # Find documents that contain any query term
    candidate_docs = set()
    for term in query_tokens:
        if term in self.inverted_index:
            candidate_docs.update(self.inverted_index[term])
    
    if not candidate_docs:
        return query_tokens
    
    # Find most frequent terms in top candidate documents
    term_freq = Counter()
    for doc_id in list(candidate_docs)[:20]:  # Use top 20 candidates
        doc = self.documents[doc_id]
        text = f"{doc['title']} {doc['description']} {doc['listed_in']}"
        tokens = self._preprocess_text(text)
        term_freq.update(tokens)
    
    # Add top 3 most frequent terms not in original query
    expanded_terms = []
    for term, _ in term_freq.most_common(10):
        if term not in query_tokens and len(term) > 2:
            expanded_terms.append(term)
            if len(expanded_terms) >= 3:
                break
    
    # Combine original query with expanded terms
    return query_tokens + expanded_terms

Contoh

Query Asli: "love"

Top Candidate Documents:
• "Love Story" - romantic, drama, relationship
• "Crazy in Love" - romantic, comedy, couple
• "Love Actually" - romantic, comedy, family

Term Analysis:
• romantic: 3 kali
• comedy: 2 kali
• drama: 1 kali

Expanded Query: "love romantic comedy drama"

Manfaat

Meningkatkan recall (lebih banyak dokumen relevan ditemukan)
Mengatasi synonym problem
Meningkatkan user experience dengan query yang pendek
Automatic relevance feedback

8. Alur Kerja Sistem

A. Initialization Phase (Saat Startup)

1. Load Data Membaca file CSV Netflix dan load ke memory sebagai list of dictionaries

2. Build Inverted Index Preprocessing setiap dokumen dan membuat mapping term → document IDs

3. Calculate TF-IDF Menghitung IDF untuk setiap term dan TF-IDF untuk setiap dokumen

4. Create Document Vectors Menyimpan representasi vektor TF-IDF untuk setiap dokumen

B. Search Phase (Saat User Search)

1. Query Preprocessing User input → lowercase → remove punctuation → tokenize → remove stopwords

2. Query Expansion Analisis dokumen relevan dan tambahkan term yang sering muncul

3. Query Vectorization Hitung TF-IDF vector untuk expanded query

4. Candidate Retrieval Gunakan inverted index untuk mendapat candidate documents

5. Calculate Similarity Hitung cosine similarity antara query vector dan setiap candidate document vector

6. Ranking Sort dokumen berdasarkan similarity score (descending)

7. Return Results Ambil top-K dokumen dan tampilkan ke user

Diagram Alur

Flowchart

User Query
    ↓
[Preprocessing]
    ↓
[Query Expansion]
    ↓
[TF-IDF Vectorization]
    ↓
[Inverted Index Lookup] → Candidate Documents
    ↓
[Cosine Similarity Calculation]
    ↓
[Ranking by Score]
    ↓
Top-K Results → Display to User

9. Struktur Kode Program

File: search_engine.py

Core logic dari search engine, berisi class NetflixSearchEngine dengan method:

Struktur Class

class NetflixSearchEngine:
    def __init__(self, csv_path: str)
        # Initialize dan load data
    
    def _load_data(self, csv_path: str)
        # Load CSV file
    
    def _preprocess_text(self, text: str) -> List[str]
        # Text preprocessing
    
    def _build_inverted_index(self)
        # Build inverted index structure
    
    def _calculate_tf(self, tokens: List[str]) -> Dict[str, float]
        # Calculate Term Frequency
    
    def _calculate_idf(self)
        # Calculate Inverse Document Frequency
    
    def _calculate_tf_idf(self)
        # Calculate TF-IDF for all documents
    
    def _cosine_similarity(self, vec1, vec2) -> float
        # Calculate cosine similarity
    
    def _expand_query(self, query: str) -> List[str]
        # Query expansion logic
    
    def search(self, query: str, top_k: int = 10) -> Dict
        # Main search function
    
    def get_statistics(self) -> Dict
        # Get system statistics

File: app.py

Flask application untuk web interface:

Flask Routes

@app.route('/')
def index():
    # Render halaman utama

@app.route('/search', methods=['POST'])
def search():
    # Handle search request
    # Return JSON results

@app.route('/stats')
def stats():
    # Return system statistics
    
@app.route('/documentation')
def documentation():
    # Render halaman dokumentasi

Kompleksitas Algoritma

Operasi	Time Complexity	Space Complexity
Build Inverted Index	O(N × M)	O(T × D)
Calculate TF-IDF	O(N × M)	O(N × M)
Search Query	O(K × M)	O(M)
Cosine Similarity	O(M)	O(1)

Keterangan: N = jumlah dokumen, M = rata-rata jumlah term per dokumen, T = total unique terms, D = jumlah dokumen yang mengandung term, K = jumlah candidate documents

Kelompok 1

Tim yang mengerjakan project ini:

Anggota Kelompok

Andika Tri Juni S

Dwiki Likuisa

Dandi Agus T

Les Endahti

Dwi Angesti Dinda P

Technology Stack

Teknologi yang digunakan dalam project ini:

Backend

Python 3.8+

Flask

Frontend

HTML5/CSS3

Bootstrap 5

JavaScript

Data

CSV

MySQL (optional)

Libraries

math, csv

collections

Netflix Search Engine - Information Retrieval System
Implementasi Inverted Index, Vector Space Model, Cosine Similarity & Query Expansion