Abstract
A pragmatic analysis of Retrieval-Augmented Generation applied to product catalogs: a production multilingual furniture system, a B2B configurator with business logic for a [Redacted] client, and a knowledge-graph document processor. One honest question: in 2026, with 1M+ token context windows, does building RAG pipelines still make sense?
This paper is also available in Italian
Leggi in italiano →1. The Uncomfortable Question: Does RAG Still Make Sense in 2026?
Let's put this on the table right away. In 2024, RAG was the answer to everything: models had 8-32K token context windows, per-token costs were high, and the only way to give an LLM domain-specific knowledge was to inject it via retrieval. It was a technical necessity, not an architectural choice.
In 2026, the landscape has shifted. Claude supports a 1 million token context window. GPT-5 handles 256K. Gemini 2.5 Pro reaches 1M. Per-token input cost has dropped by an order of magnitude since 2024. The legitimate question is: why not load the entire catalog into the prompt and be done with it?
The answer is not "because RAG is better." The answer is: it depends on the catalog, the use case, and how much you care about control over results.
I built three RAG systems for product catalogs over the past year. None of them would have worked better with simple context stuffing. But not because RAG is always the answer. Because these specific cases had characteristics that made it the right choice. In this paper, I honestly analyze when RAG adds value and when it's overengineering.
Bias check
I'm not a RAG evangelist. I've seen too many projects where a PostgreSQL full-text search would have solved the problem in an afternoon, but someone built a pipeline with embeddings, vector store, reranking and chunking, only to discover the catalog had 200 products.
2. Decision Framework: When RAG Makes Sense
Before describing the implementations, we need an honest framework for deciding whether RAG is the right choice. I distilled this from experience with these three projects and a couple others where I decided not to use it.
2.1 RAG is justified when:
- The catalog exceeds 500-1000 products. Below that threshold, context stuffing costs less in complexity than RAG costs in infrastructure
- Search is multilingual with domain-specific entities. General-purpose LLMs don't map "cromato satinato" to "satin chrome" without explicit support
- You need responses under 200ms. Context stuffing with 100K+ tokens adds significant latency to generation
- The catalog changes frequently. Re-indexing a vector store is incremental; rewriting prompts with the entire catalog is not
- You need traceability. Knowing exactly which product contributed to the response, with relevance scores
2.2 RAG is overengineering when:
- The catalog has fewer than 200-300 products. Load them into the prompt, done
- Queries are simple and monolingual. Full-text search with pg_trgm or Elasticsearch is sufficient
- You don't need generation, just retrieval. You're building a search engine, not a RAG system
- The budget doesn't justify the infrastructure. pgvector, embedding APIs, Redis cache all have operational costs
- The team lacks ML ops skills. A production RAG system requires embedding monitoring, drift detection, periodic re-indexing
| Scenario | Recommended approach | Why |
|---|---|---|
| < 200 products, monolingual | Context stuffing in prompt | Less complexity, negligible cost |
| 200-1000 products, simple queries | Full-text search (PostgreSQL) | Minimal infrastructure, excellent performance |
| 1000+ products, multilingual | RAG with hybrid search | Semantic matching + keyword, necessary for cross-language |
| Unstructured documents (PDFs, contracts) | RAG with knowledge graph | Entity relationships matter more than vector similarity |
| Products with complex configurations | RAG + business logic layer | Retrieval alone isn't enough, domain logic required |
3. Implementation #1: Multilingual Furniture Catalog
3.1 The problem
A furniture catalog with products in Italian and English. Users search in one language, products have metadata in the other. "Divano angolare grigio" must find the product cataloged as "Corner sofa, grey." Full-text search fails here. Not because of PostgreSQL limitations, but because the words are different.
3.2 Architecture
User query (IT or EN)
|
Language Detection (langdetect)
|
Entity Extraction (60+ domain-specific mappings)
|
Query Translation (IT <> EN)
|
Embedding (Jina v3, 1024-dim, on EN translation)
|
+--------------------+
| Hybrid Search |
| Vector: 70% |
| Full-text: 30% |
| RRF k=60 |
+--------------------+
|
Ranked results with relevance scores3.3 Technical choices and rationale
Embedding model: Jina v3 (1024 dimensions)
I chose Jina v3 for two reasons: native multilingual support (Italian and English with the same model) and 1024 dimensionality that balances semantic quality against storage and query costs on pgvector. With 10,000 products and an HNSW index, queries stay under 50ms.
Hybrid search: why 70/30 and not 50/50
The 70% vector / 30% keyword weight is not arbitrary. Reciprocal Rank Fusion with k=60 combines results from both search methods. In the furniture domain, semantic similarity matters more than exact matching because users describe products with vocabulary that differs from catalog terminology. "Reading armchair" contains no keywords from the product "Poltrona da lettura, tessuto, ergonomica." Only the vector connects them.
The 30% keyword component serves as a guardrail: when the user specifies a product code, an exact dimension, or a brand name, exact matching must win over semantic similarity.
Entity extraction: 60+ domain mappings
This is the part that required the most manual work and produced the most significant improvement. I built a dictionary of 60+ furniture-specific terms with translations: divano=sofa, cromato=chrome, angolare=corner, bagno=bathroom. When a user searches "mobile bagno sospeso," the entity extractor identifies three entities (mobile=furniture, bagno=bathroom, sospeso=wall-mounted) and uses them to enrich the query.
Without this layer, cross-language search precision dropped by 30-35%. General-purpose embeddings don't map niche technical terminology with sufficient accuracy.
Cache: Redis with 5-minute TTL
Queries in the furniture domain have a long-tail distribution with a concentrated head. "Sofa" and "table" cover 40% of searches. Caching with a 5-minute TTL reduces embedding API calls by 60% on a typical day, while keeping results fresh for catalog updates.
< 80ms
P50 Latency
Includes embedding + hybrid search + ranking
< 200ms
P99 Latency
With cache miss and full catalog
85%+
Cross-language precision
IT>EN and EN>IT on 200-query test set
~60%
Cache hit rate
With 5-min TTL on typical traffic
4. Implementation #2: B2B Catalog with Product Configuration
4.1 The problem
A B2B catalog where the product is not a single item but a modular configuration. Each product code decomposes into a component system where each segment represents a distinct part. Each component has dozens of finish variants with different pricing matrices. The catalog spans multiple collections with hundreds of possible combinations.
Classic retrieval is not enough here. You're not searching for "a chrome product." You're searching for a specific configuration with precise materials, finishes, and compatibility constraints. That's semantic search plus business logic.
4.2 Architecture
User query (product code or description)
|
RAG Agent > interprets product code
|
Component decomposition
|
For each component:
> Retrieve available variants
> Pricing matrix
> Compatibility check
|
Visual configurator
|
PDF quote generation4.3 Key lesson: RAG is just the retrieval layer
This project clarified a point that most RAG tutorials skip: retrieval is just the first step. After finding the right product, a business logic layer is needed to handle variant compatibility, pricing matrices, and configuration rules.
The RAG agent interprets the query and finds the product in the catalog. But the final quote requires deterministic logic: certain finishes cost 40% more across all components, and not all variants are compatible with each other. This logic is not the LLM's job. It's code.
Common mistake: delegating business logic to the LLM because "it's easier." No. The model hallucinates on pricing, invents compatibility, and every error in a B2B quote is real economic damage. RAG finds the product. Code generates the quote.
4.4 Data extraction: crawl4ai for finish data
A sub-problem was populating the database of available finishes per product. This information lived on the manufacturer's website, but not in a structured format. I used crawl4ai to scrape product pages and extract the product-to-finish mapping via regex on navigation links.
The result is a CSV mapping each product code to its available finishes. No LLM, no embeddings. Just a targeted scraper and a regex. Not everything needs to be AI.
5. Implementation #3: Unstructured Documents with Knowledge Graph
5.1 The problem
PDF documents, scanned images, and text files in various formats. Not a structured catalog but a heterogeneous archive: contracts, technical specs, reports, manuals. Six document types (legal, technical, financial, medical, academic, general) with different retrieval needs.
5.2 Why knowledge graph instead of vector search
For unstructured documents, vector similarity alone is insufficient. A contract mentioning "penalty of EUR 50,000 for delivery delay" and another stating "compensation clause: fifty thousand euros for timeline non-compliance" are semantically close. But the relationship that matters is that both refer to the same project with the same supplier.
LightRAG builds a knowledge graph from entities extracted from documents: companies, people, amounts, dates, clauses. Queries traverse the graph following relationships, not just vector similarity. "All contracts with penalties above EUR 10,000 for supplier X" requires graph traversal, not cosine similarity.
5.3 Stack and pipeline
| Phase | Technology | Detail |
|---|---|---|
| PDF text extraction | pypdf | Page by page, preserves structure |
| Image OCR | Pillow + Tesseract | For scanned documents |
| Knowledge graph | LightRAG | Automatic entity-relationship construction |
| LLM | GPT-4o Mini | Answer synthesis with graph context |
| Prompt engineering | Per document type | Tone and depth adaptation |
| UI | Streamlit | Fast prototype for validation |
5.4 Document-type adaptation
One aspect that significantly improved response quality: different prompts for different document types. When the system knows it's working with legal contracts, the prompt emphasizes terminological precision and clause citation. For technical documents, it emphasizes numerical specifications and tolerances.
This isn't a sophisticated feature. It's a parameter that changes the system prompt. But the difference in perceived quality is substantial. A legal document analyzed with a generic prompt produces vague answers. The same document with a legal-specific prompt produces precise citations with section references.
6. Comparison Across the Three Implementations
| Aspect | Furniture (pgvector) | B2B [Redacted] (RAG + logic) | Documents (LightRAG) |
|---|---|---|---|
| Retrieval approach | Hybrid vector + keyword | Semantic + business rules | Knowledge graph traversal |
| Embedding model | Jina v3 (1024-dim) | N/A (mockup) | OpenAI embeddings |
| Data type | Structured products | Complex configurations | Unstructured documents |
| Multilingual | Yes (IT/EN) | No | Yes (configurable) |
| Latency target | < 200ms | Interactive | Standard |
| Infrastructure complexity | Medium (pgvector + Redis) | High (RAG + business logic + PDF) | Low (LightRAG standalone) |
| LLM for generation | No (retrieval only) | Mockup | GPT-4o Mini |
| Main takeaway | Hybrid search > pure vector | RAG is not the whole solution | Graph > vector for relationships |
7. RAG's Real Competitor in 2026: Context Stuffing and Grounding
Let's address the elephant in the room. Google Gemini offers native grounding: connect your data store and the model searches on its own. OpenAI has file search built into Assistants. Claude with 1M context can ingest an entire catalog without chunking, embedding, or vector stores.
If I were a RAG enthusiast, I'd pretend these alternatives don't exist. Instead, I use them. And honestly, for certain use cases they're better.
7.1 When context stuffing wins
- Small catalog (< 500 products): load everything into the prompt, get immediate responses, zero infrastructure
- Prototyping: when validating an idea, a vector store is overhead that slows iteration
- Queries requiring reasoning over the entire catalog: "what's the cheapest product in each category" needs global view, not point retrieval
7.2 When RAG still wins
- Scale: with 10,000+ products, context stuffing gets expensive and slow. Token cost isn't negligible when you multiply by thousands of daily queries
- Precision: retrieval with scores tells you how confident the match is. Context stuffing doesn't
- Traceability: in a RAG system you know exactly which chunk generated the response. Critical for debugging and compliance
- Latency: searching a vector index is orders of magnitude faster than processing 100K+ tokens of context
- Incremental updates: add a product, update an embedding. Don't rebuild the entire prompt
- Privacy and control: data stays in your database, not transiting entirely through third-party APIs on every query
7.3 My position
RAG in 2026 is no longer a technical necessity. It's an architectural choice. The distinction matters. In 2024 you had no real alternatives. Context windows were too small. Today you have alternatives. Choose RAG when you need control, scale, and traceability. Choose context stuffing when you need simplicity and development speed.
The mistake I see most often is building a RAG system because "it's best practice." Without asking whether the use case justifies it. The symmetric mistake is dismissing RAG because "models have large context now," ignoring that cost, latency, and control don't scale linearly with context window size.
8. Patterns and Anti-Patterns from All Three Projects
8.1 Patterns that worked
- Always use hybrid search: pure vector search loses on queries with product codes, exact dimensions, or brand names. The 30% keyword component is an essential guardrail.
- Domain-specific entity extraction: 80 manual mappings improved precision more than any embedding optimization. The domain matters more than the algorithm.
- Separate retrieval from business logic: RAG finds, code decides. Never delegate calculations, compatibility checks, or pricing to the LLM.
- Aggressive caching on frequent queries: Zipf distribution means 20% of queries cover 60% of traffic.
- Document-type-specific prompts: a parameter that changes the system prompt improves perceived quality more than a better embedding model.
8.2 Anti-patterns to avoid
- RAG for small catalogs: below 200-300 products, context stuffing is simpler and equally effective.
- Aggressive chunking on structured data: a product is an atomic unit. Don't split it into chunks. Chunking is for long documents, not database records.
- Blindly trusting general-purpose embeddings for technical terminology: "cromato satinato" and "satin chrome" are far apart in a generic model's vector space.
- Ignoring cold start: the first query after deployment requires index warm-up. Plan for pre-heating.
- Delegating pricing and calculations to the model: a hallucination on a B2B quote is not a bug. It's economic damage.
9. Real Costs of a Production RAG System
An aspect tutorials rarely cover: how much it costs to maintain a RAG system in production. Not development cost, that's a one-time investment. Monthly operational cost for the furniture system, the most mature of the three:
| Component | Service | Estimated monthly cost |
|---|---|---|
| Vector DB | PostgreSQL + pgvector (managed) | ~$25-50/month |
| Embedding API | Jina v3 (pay-per-use) | ~$10-20/month (at ~50K queries/month) |
| Cache | Redis (managed) | ~$10-15/month |
| Compute | FastAPI on container | ~$15-30/month |
| Total | ~$60-115/month |
For comparison, context stuffing the same catalog would cost roughly $0.10-0.15 per query (with a catalog of ~50K tokens). At 50K queries/month, that's $5,000-7,500. At 1K queries/month, it's $100-150: comparable to RAG cost but without infrastructure.
The break-even point is around 500-1,000 queries/month, depending on the model used for context stuffing. Below that threshold, context stuffing is probably cheaper when you factor in total cost (infrastructure + maintenance + monitoring). Above it, RAG becomes progressively more cost-effective.
10. What This All Means
Three RAG projects, three different domains, one common lesson: the technology is mature, but not universal. RAG is a tool, not a religion.
The furniture system proved that hybrid search with domain entity extraction produces results that pure vector search can't match. The B2B project proved that RAG is just the retrieval layer: business logic must be deterministic code, not prompts. The document system proved that for complex relationships, knowledge graphs go beyond what vector stores can handle.
In 2026, the question is no longer "should I use RAG?" but "which layer of my system needs semantic retrieval?" Answer that question honestly, measuring costs, complexity, and alternatives, and you'll have the right answer. Whether it's RAG, context stuffing, or PostgreSQL full-text search.
The most expensive mistake is not picking the wrong technology. It's picking the trendy technology without asking whether you need it.
Want to build something like this?
If you have a technical project requiring advanced AI architectures, let's talk.
