LLM Interview Preparation: Complete Crash Course

1. Prompt Engineering & LLM Basics

Core Concepts

What is an LLM?

Large Language Models (LLMs) are neural networks trained on massive text datasets to predict the next token in a sequence. They use the Transformer architecture and are trained in two main phases:

• Pre-training: Unsupervised learning on vast text corpora to learn language patterns
• Fine-tuning: Supervised learning on specific tasks or instruction-following

Tokens Explained

Tokens are the basic units of text processing in LLMs. A token is roughly:

• ~4 characters in English
• ~0.75 words on average
• Can be a word, subword, or character depending on the tokenizer

Example: "Hello, world!" → ["Hello", ",", " world", "!"] = 4 tokens

Temperature Parameter

Temperature controls randomness in token selection:

• Low (0-0.3): Deterministic, focused outputs (good for factual tasks, code)
• Medium (0.5-0.7): Balanced creativity and coherence
• High (0.8-1.0): Creative, varied outputs (good for brainstorming, creative writing)

Prompt Engineering Techniques

Zero-Shot Prompting

Asking the model to perform a task without examples.

Classify the sentiment: "I love this product!" → Positive

Few-Shot Prompting

Providing examples to guide the model's behavior.

Sentiment examples:
"Great!" → Positive
"Terrible" → Negative
"It's okay" → Neutral

Now classify: "Amazing product!" → Positive

Chain-of-Thought (CoT)

Asking the model to show its reasoning step-by-step.

Q: Roger has 5 tennis balls. He buys 2 more cans of 3 balls each. How many does he have?
A: Let's think step by step:
1. Roger starts with 5 balls
2. He buys 2 cans × 3 balls = 6 balls
3. Total: 5 + 6 = 11 balls

Role Prompting

Assigning a specific role or persona to the model.

You are an expert Python developer. Review this code and suggest improvements...

Common Interview Questions

Q: What's the difference between Predictive AI and Generative AI?

A: Predictive AI (discriminative models) classifies or predicts from existing data (e.g., spam detection, image classification). Generative AI creates new content by learning data distributions (e.g., text generation, image synthesis). LLMs are generative models.

Q: How do you estimate LLM API costs?

A: Cost = (Input tokens × Input price) + (Output tokens × Output price)

Example: GPT-4 costs ~$0.03/1K input tokens, $0.06/1K output tokens. For 1000 requests with 500 input + 200 output tokens each: Cost = (500K × $0.03) + (200K × $0.06) = $15 + $12 = $27

Q: What are different decoding strategies?

A:

• Greedy: Always pick highest probability token (deterministic but repetitive)
• Beam Search: Keep top-k sequences, explore multiple paths
• Top-k Sampling: Sample from top k most likely tokens
• Top-p (Nucleus): Sample from smallest set of tokens with cumulative probability ≥ p

Q: How to control hallucinations with prompt engineering?

A:

• Use explicit instructions: "Only use information from the provided context"
• Request citations: "Cite sources for each claim"
• Add uncertainty handling: "If unsure, say 'I don't know'"
• Use Chain-of-Thought to expose reasoning
• Lower temperature for factual tasks

2. Retrieval Augmented Generation (RAG)

What is RAG?

RAG combines retrieval systems with LLMs to provide accurate, up-to-date, and verifiable responses. Instead of relying solely on the model's training data, RAG retrieves relevant information from external sources and includes it in the prompt.

RAG Pipeline

1. User Query: User asks a question
2. Retrieval: Query is embedded and used to search vector database for relevant documents
3. Augmentation: Retrieved documents are added to the prompt as context
4. Generation: LLM generates response based on query + retrieved context

Benefits of RAG

• Accuracy: Grounds responses in factual, retrieved information
• Up-to-date: Can access current information beyond training cutoff
• Verifiable: Can cite sources for claims
• Domain-specific: Works with proprietary/private data
• Cost-effective: Cheaper than fine-tuning for knowledge updates

RAG vs Fine-Tuning

Aspect	RAG	Fine-Tuning
Use Case	Knowledge injection, factual QA	Behavior/style adaptation, task-specific
Cost	Low (retrieval + inference)	High (training compute + data)
Updates	Easy (update knowledge base)	Hard (requires retraining)
Transparency	High (can cite sources)	Low (knowledge in weights)
Latency	Higher (retrieval overhead)	Lower (direct inference)

Interview Questions

Q: How does RAG work?

A: RAG retrieves relevant documents from a knowledge base using semantic search (vector similarity), then includes these documents as context in the LLM prompt. The LLM generates a response grounded in the retrieved information, reducing hallucinations and enabling access to current/private data.

Q: When should you use Fine-tuning instead of RAG?

A: Use fine-tuning when you need to:

• Change the model's behavior, tone, or output format consistently
• Teach domain-specific reasoning patterns
• Minimize latency (no retrieval overhead)
• Work with structured outputs or specific task performance

Use RAG when you need to inject knowledge, work with frequently updated information, or require source attribution.

Q: What are architecture patterns for customizing LLMs with proprietary data?

A:

• RAG: Retrieve relevant docs and include in prompt
• Fine-tuning: Train model on proprietary data
• Hybrid: Fine-tune for domain behavior + RAG for knowledge
• Prompt Engineering: Include examples/instructions in prompt
• Function Calling: LLM calls APIs to access data

3. Chunking Strategies

Why Chunking Matters

Chunking breaks large documents into smaller pieces for efficient retrieval and processing. Chunk size affects retrieval accuracy, context quality, and system performance.

Common Chunking Methods

• Fixed-size: Split by character/token count (simple but may break semantic units)
• Sentence-based: Split by sentences (preserves meaning)
• Paragraph-based: Split by paragraphs (good for structured docs)
• Semantic: Split by topic/meaning using embeddings
• Recursive: Try multiple separators hierarchically

Key Interview Questions

Q: How to find ideal chunk size?

A: Experiment with different sizes (256, 512, 1024 tokens). Evaluate using retrieval metrics (precision, recall). Consider: embedding model context window, query complexity, document structure. Use overlap (10-20%) to preserve context across chunks.

4. Embedding Models

What are Embeddings?

Embeddings are dense vector representations of text that capture semantic meaning. Similar texts have similar vectors (measured by cosine similarity).

Popular Embedding Models

• OpenAI text-embedding-3: High quality, 1536 dimensions
• Sentence Transformers: Open source, customizable
• Cohere Embed: Multilingual support
• BGE/E5: State-of-the-art open models

Interview Questions

Q: How to improve embedding model accuracy?

A: 1) Fine-tune on domain data 2) Use hard negatives in training 3) Adjust similarity threshold 4) Try different models 5) Normalize embeddings 6) Use hybrid search (keyword + semantic)

5. Vector Databases

Vector Database Fundamentals

Vector databases store and retrieve high-dimensional vectors efficiently using specialized indexing algorithms (HNSW, IVF, PQ).

Popular Vector DBs

• Pinecone: Managed, scalable, easy to use
• Weaviate: Open source, GraphQL API
• Qdrant: Rust-based, high performance
• Chroma: Lightweight, developer-friendly
• Milvus: Highly scalable, cloud-native

6. Advanced Search Algorithms

Search Techniques

• Semantic Search: Vector similarity (cosine, dot product)
• Keyword Search: BM25, TF-IDF
• Hybrid Search: Combine semantic + keyword (RRF for merging)
• Re-ranking: Cross-encoder models for better relevance
• Query Expansion: Expand query with synonyms/related terms
• Multi-hop: Iterative retrieval for complex queries

7. Language Model Internals

Transformer Architecture

Transformers use self-attention to process sequences in parallel, learning relationships between all tokens.

Key Components

• Self-Attention: Computes attention scores between all token pairs
• Multi-Head Attention: Multiple attention mechanisms in parallel
• Feed-Forward Networks: Process each position independently
• Positional Encoding: Adds position information to embeddings
• Layer Normalization: Stabilizes training

Interview Questions

Q: How to increase context length?

A: 1) Positional interpolation 2) ALiBi (Attention with Linear Biases) 3) Sparse attention patterns 4) Sliding window attention 5) Memory-augmented architectures

8. Supervised Fine-Tuning

Fine-Tuning Approaches

• Full Fine-Tuning: Update all model weights (expensive)
• LoRA: Low-Rank Adaptation - train small adapter matrices
• QLoRA: Quantized LoRA for consumer hardware
• Prefix Tuning: Learn task-specific prefixes
• Adapter Layers: Insert trainable layers between frozen layers

Interview Questions

Q: What is catastrophic forgetting?

A: When fine-tuning on new data causes the model to forget previously learned knowledge. Mitigation: 1) Mix old and new data 2) Use LoRA 3) Regularization 4) Multi-task learning 5) Elastic Weight Consolidation

9. Preference Alignment (RLHF/DPO)

Alignment Methods

• RLHF: Reinforcement Learning from Human Feedback - train reward model, then optimize with PPO
• DPO: Direct Preference Optimization - simpler, no reward model needed
• RLAIF: Use AI feedback instead of human
• Constitutional AI: Self-critique and revision

10. LLM Evaluation

Evaluation Metrics

• Perplexity: How well model predicts next token
• BLEU/ROUGE: N-gram overlap with reference
• BERTScore: Semantic similarity using embeddings
• Human Eval: Manual quality assessment
• LLM-as-Judge: Use GPT-4 to evaluate outputs
• Task-specific: Accuracy, F1, exact match for QA

11. Hallucination Control

Control Techniques

• RAG: Ground responses in retrieved facts
• Prompt Engineering: Explicit instructions, citations
• Confidence Scoring: Detect uncertain outputs
• Fact Verification: Cross-check claims with sources
• Chain of Verification: Generate, verify, refine
• Constrained Decoding: Limit to factual outputs

12. Deployment & Optimization

Optimization Techniques

• Quantization: Reduce precision (INT8, INT4) - minimal accuracy loss
• KV Cache: Cache key-value pairs for faster generation
• Batching: Process multiple requests together
• Speculative Decoding: Use small model to draft, large to verify
• Flash Attention: Memory-efficient attention computation
• Model Distillation: Train smaller model from larger

13. Agent-Based Systems

Agent Patterns

• ReAct: Reasoning + Acting - think, act, observe loop
• Plan-and-Execute: Plan steps, then execute
• Function Calling: LLM calls predefined tools/APIs
• Multi-Agent: Multiple specialized agents collaborate
• Reflexion: Self-reflection and learning from mistakes

14. Prompt Hacking & Security

Attack Types

• Prompt Injection: Override system instructions
• Jailbreaking: Bypass safety guardrails
• Data Leakage: Extract training data

Defense Tactics

• Input validation and sanitization
• Separate system and user prompts
• Output filtering and moderation
• Rate limiting and monitoring

15. Case Studies & Scenarios

Common Interview Scenarios

Scenario: RAG system not accurate

Steps: 1) Evaluate retrieval quality 2) Improve chunking 3) Fine-tune embeddings 4) Add re-ranking 5) Use hybrid search 6) Optimize prompts

Scenario: High latency in production

Solutions: 1) Quantization 2) Smaller model 3) Caching 4) Batching 5) Async processing 6) Edge deployment

🚀 Hands-On: Build an LLM Chat Assistant with Dynamic Context

Now let's put theory into practice! We'll build a complete RAG-based chat assistant that dynamically retrieves relevant context based on user queries.

System Architecture

Our system will have:

• Document Ingestion: Load and chunk documents
• Vector Store: Store embeddings in Chroma
• Retrieval: Find relevant docs for each query
• Generation: LLM generates response with context
• Chat Interface: Interactive conversation

Step 1: Setup & Dependencies

First, install required packages:

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install langchain langchain-openai chromadb python-dotenv tiktoken

Create a .env file:

OPENAI_API_KEY=your_api_key_here

Step 2: Complete Implementation

Create llm_chat_assistant.py:

import os
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.document_loaders import TextLoader, DirectoryLoader

# Load environment variables
load_dotenv()

class LLMChatAssistant:
    """RAG-based chat assistant with dynamic context retrieval"""
    
    def __init__(self, docs_path="./documents", persist_directory="./chroma_db"):
        self.docs_path = docs_path
        self.persist_directory = persist_directory
        self.embeddings = OpenAIEmbeddings()
        self.llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.7)
        self.vectorstore = None
        self.chain = None
        
    def load_documents(self):
        """Load documents from directory"""
        print("Loading documents...")
        loader = DirectoryLoader(
            self.docs_path,
            glob="**/*.txt",
            loader_cls=TextLoader
        )
        documents = loader.load()
        print(f"Loaded {len(documents)} documents")
        return documents
    
    def chunk_documents(self, documents):
        """Split documents into chunks"""
        print("Chunking documents...")
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=50,
            separators=["\n\n", "\n", " ", ""]
        )
        chunks = text_splitter.split_documents(documents)
        print(f"Created {len(chunks)} chunks")
        return chunks
    
    def create_vectorstore(self, chunks):
        """Create vector store from chunks"""
        print("Creating vector store...")
        self.vectorstore = Chroma.from_documents(
            documents=chunks,
            embedding=self.embeddings,
            persist_directory=self.persist_directory
        )
        print("Vector store created!")
    
    def setup_chain(self):
        """Setup conversational retrieval chain"""
        print("Setting up retrieval chain...")
        
        # Create memory for conversation history
        memory = ConversationBufferMemory(
            memory_key="chat_history",
            return_messages=True,
            output_key="answer"
        )
        
        # Create retrieval chain
        self.chain = ConversationalRetrievalChain.from_llm(
            llm=self.llm,
            retriever=self.vectorstore.as_retriever(
                search_kwargs={"k": 3}  # Retrieve top 3 chunks
            ),
            memory=memory,
            return_source_documents=True
        )
        print("Chain ready!")
    
    def initialize(self):
        """Initialize the complete system"""
        # Load and process documents
        documents = self.load_documents()
        chunks = self.chunk_documents(documents)
        
        # Create vector store
        self.create_vectorstore(chunks)
        
        # Setup chain
        self.setup_chain()
    
    def chat(self, query):
        """Process a query and return response"""
        if not self.chain:
            raise ValueError("System not initialized. Call initialize() first.")
        
        # Get response
        result = self.chain({"question": query})
        
        # Extract answer and sources
        answer = result["answer"]
        sources = result["source_documents"]
        
        return {
            "answer": answer,
            "sources": sources
        }
    
    def interactive_chat(self):
        """Start interactive chat session"""
        print("\n" + "="*50)
        print("LLM Chat Assistant Ready!")
        print("Type 'quit' to exit")
        print("="*50 + "\n")
        
        while True:
            # Get user input
            query = input("\nYou: ").strip()
            
            if query.lower() in ['quit', 'exit', 'q']:
                print("Goodbye!")
                break
            
            if not query:
                continue
            
            # Get response
            try:
                result = self.chat(query)
                
                # Display answer
                print(f"\nAssistant: {result['answer']}")
                
                # Display sources
                if result['sources']:
                    print(f"\n📚 Sources ({len(result['sources'])} documents):")
                    for i, doc in enumerate(result['sources'], 1):
                        preview = doc.page_content[:100].replace('\n', ' ')
                        print(f"  {i}. {preview}...")
                        
            except Exception as e:
                print(f"Error: {e}")

# Main execution
if __name__ == "__main__":
    # Create sample documents directory
    os.makedirs("./documents", exist_ok=True)
    
    # Create sample document if none exist
    sample_doc = "./documents/sample.txt"
    if not os.path.exists(sample_doc):
        with open(sample_doc, "w") as f:
            f.write("""
Large Language Models (LLMs) are neural networks trained on vast amounts of text data.
They use the Transformer architecture which relies on self-attention mechanisms.

RAG (Retrieval-Augmented Generation) combines retrieval with generation.
It retrieves relevant documents and includes them as context for the LLM.
This improves accuracy and allows access to up-to-date information.

Vector databases store embeddings and enable fast similarity search.
Popular options include Pinecone, Weaviate, and Chroma.
They use algorithms like HNSW for efficient nearest neighbor search.
            """)
    
    # Initialize and run
    assistant = LLMChatAssistant()
    assistant.initialize()
    assistant.interactive_chat()

Step 3: Run the Assistant

python llm_chat_assistant.py

Example Interaction:

You: What is RAG?

LLM Interview Preparation: Complete Crash Course

Table of Contents

1. Prompt Engineering & LLM Basics

Core Concepts

What is an LLM?

Tokens Explained

Temperature Parameter

Prompt Engineering Techniques

Zero-Shot Prompting

Few-Shot Prompting

Chain-of-Thought (CoT)

Role Prompting

Common Interview Questions

Q: What's the difference between Predictive AI and Generative AI?

Q: How do you estimate LLM API costs?

Q: What are different decoding strategies?

Q: How to control hallucinations with prompt engineering?

2. Retrieval Augmented Generation (RAG)

What is RAG?

RAG Pipeline

Benefits of RAG

RAG vs Fine-Tuning

Interview Questions

Q: How does RAG work?

Q: When should you use Fine-tuning instead of RAG?

Q: What are architecture patterns for customizing LLMs with proprietary data?

3. Chunking Strategies

Why Chunking Matters

Common Chunking Methods

Key Interview Questions

Q: How to find ideal chunk size?

4. Embedding Models

What are Embeddings?

Popular Embedding Models

Interview Questions

Q: How to improve embedding model accuracy?

5. Vector Databases

Vector Database Fundamentals

Popular Vector DBs

6. Advanced Search Algorithms

Search Techniques

7. Language Model Internals

Transformer Architecture

Key Components

Interview Questions

Q: How to increase context length?

8. Supervised Fine-Tuning

Fine-Tuning Approaches

Interview Questions

Q: What is catastrophic forgetting?

9. Preference Alignment (RLHF/DPO)

Alignment Methods

10. LLM Evaluation

Evaluation Metrics

11. Hallucination Control

Control Techniques

12. Deployment & Optimization

Optimization Techniques

13. Agent-Based Systems

Agent Patterns

14. Prompt Hacking & Security

Attack Types

Defense Tactics

15. Case Studies & Scenarios

Common Interview Scenarios

Scenario: RAG system not accurate

Scenario: High latency in production

🚀 Hands-On: Build an LLM Chat Assistant with Dynamic Context

System Architecture

Step 1: Setup & Dependencies

Step 2: Complete Implementation

Step 3: Run the Assistant

Example Interaction: