Document chunking is a critical preprocessing step in retrieval-augmented generation (RAG) systems that has received limited empirical evaluation. We present a systematic empirical analysis of adaptive structure-aware chunking across four diverse domains, six embedding models, and five chunk sizes (355 successful experiments across 360 planned configurations). Our adaptive method detects document structure, preserves sentence boundaries, and enriches chunks with section metadata. Contrary to initial smaller-scale experiments which suggested no clear advantage, this expanded evaluation reveals improvements across three of four tested domains: scientific literature (+0.21%), medical documents (+0.66%), and argumentative text (+0.05%), with financial discussions showing a slight negative result (−0.12%). The overall improvement is +0.20% nDCG@10. While individual gains are modest, the consistency across three diverse domains suggests that structure-aware preprocessing provides benefits when documents have clear structural organization. Across the four BEIR datasets evaluated, performance gains track corpus structure indicators (composite r = 0.85); we treat this correlation as exploratory given n=4 and use it to motivate a practitioner-facing heuristic for predicting when adaptive chunking will help. This empirical finding aligns with concurrent theoretical work on hierarchical semantic entropy (Zhong et al., 2026), which predicts that entropy rate increases with corpus semantic complexity. We also find that instruction-tuned embedding models benefit substantially more (E5-Large: +1.29%) than classical models (MPNet: −0.32%), suggesting metadata enrichment acts as a surrogate instruction prefix.
Metadata-as-prefix injection—prepending speaker, timestamp, topic, and structural annotations to conversation chunks before embedding—is an intuitive strategy for improving conversational memory retrieval. We present a comprehensive evaluation across seven embedding models, five metadata schemas, and three evaluation contexts totaling over 69,000 retrieval evaluations that reveals a striking divergence between benchmark and production performance. On the LoCoMo benchmark (1,986 questions, 5,882 turns), metadata enrichment produces large, statistically significant improvements across all seven models: the best combination (E5-large-v2 + Schema C) achieves MRR of 0.472, a +220% improvement over raw text (p < 10−154, d = 0.77). All 28 schema-vs-baseline comparisons are significant after Bonferroni correction. However, production evaluation on real agent memory (120 queries, 817 chunks) reveals that the same enrichment strategy degrades retrieval: mxbai-embed-large shows −10% Recall@5 (p = 0.003). Embedding space analysis identifies the mechanism: metadata prefixes increase pairwise chunk similarity across all four tested models (all p < 0.001, Cohen's d = 1.08–1.88), collapsing the vector space and reducing discriminative power. This benchmark-production divergence carries a methodological warning for the field: metadata enrichment strategies must be validated on deployment-representative data, not benchmarks alone.
Raw text with potential structural markers (headings, paragraphs, lists)
Parse Markdown ATX headers (#, ##), paragraph breaks, list markers → hierarchical map
Split at nearest sentence boundary within budget (512 tokens → 2048 chars). Prefer section breaks. 10% overlap.
Prefix each chunk with section title: "Section: Related Work". Prefix counts toward budget.
Structure-aware chunks with metadata, ready for embedding
Raw text
Split at exact character position (2048 chars). 10% overlap. Ignores all structure.
Raw chunks, may split mid-sentence or mid-word
Raw text
Try paragraphs → sentences → characters. LangChain RecursiveCharacterTextSplitter.
Better boundary respect, no metadata enrichment
Relative % change in nDCG@10 (adaptive vs fixed), averaged across chunk sizes. ■ Blue = adaptive wins, ■ Orange = fixed wins. Values include +/− signs for accessibility.
| Dataset | MPNet | BGE-Large | BGE-M3 | BGE-Small | E5-Large | GTE-Large | Mean |
|---|
Averaged across all datasets and models. Adaptive advantage is largest at 256 tokens (+1.04%).
E5-Large (instruction-tuned) shows the strongest response to adaptive chunking, likely leveraging metadata prefixes as surrogate instructions.
| Model | Type | SciFact Δ | NFCorpus Δ | FiQA Δ | ArguAna Δ | Mean Δ |
|---|---|---|---|---|---|---|
| E5-Large | Instruction-tuned | ↑ +1.86% | ↑ +2.64% | ↓ −0.06% | ↑ +0.71% | ↑ +1.29% |
| GTE-Large | General | ↑ +0.18% | ↑ +0.45% | +0.04% | ↑ +0.99% | ↑ +0.41% |
| BGE-Small | Compact | ↑ +0.24% | ↑ +0.71% | ↓ −0.20% | ↓ −0.49% | +0.07% |
| BGE-Large | General | ↓ −0.51% | ↑ +0.17% | ↑ +0.46% | +0.00% | +0.03% |
| MPNet | Classic | ↑ +0.21% | ↓ −0.55% | ↓ −0.76% | ↓ −0.19% | ↓ −0.32% |
| BGE-M3 | Multilingual | ↓ −0.78% | ↑ +0.51% | N/A | ↓ −0.70% | ↓ −0.32% |
FiQA shows a split: nDCG decreases while Recall and MRR improve - suggesting adaptive chunking on informal content improves coverage but slightly hurts precision-weighted ranking.
Datasets with titles and longer documents benefit most from adaptive chunking. Pearson correlation between composite structure score and nDCG@10 delta: r = 0.85.
| Dataset | Titles | Avg Chars | Score | nDCG Δ |
|---|---|---|---|---|
| NFCorpus | 100% | 1,497 | 0.611 | ↑ +0.66% |
| SciFact | 100% | 1,401 | 0.599 | ↑ +0.21% |
| ArguAna | 31% | 1,007 | 0.299 | +0.05% |
| FiQA | 0% | 767 | 0.169 | ↓ −0.12% |
Key insight: Adaptive chunking benefits scale with document structure availability. Datasets with titles (metadata source) and longer documents (more chunking opportunities) see the largest gains. FiQA's 0% title availability leaves nothing for metadata enrichment to work with.
I'm an independent researcher focused on practical applications of machine learning, with particular interest in retrieval-augmented generation (RAG) systems, document processing, and AI tooling. My work combines empirical evaluation with hands-on system building.
My recent research examines how document preprocessing strategies affect retrieval performance across different domains. I believe in open, reproducible research and sharing both successes and failures to advance the field.
When I'm not researching, you might find me following Formula 1, experimenting with new AI models, or building tools to make research more efficient and accessible.
An expanded benchmark (v3) is designed and ready to execute: 8 BEIR datasets, 10 embedding models, ~1,600 experiments across four methods — Fixed, Recursive, Adaptive, and a new Adaptive-NoMeta ablation that isolates boundary-awareness from metadata injection. This includes per-query Wilcoxon signed-rank tests with bootstrap confidence intervals, document-level sensitivity analysis, and computational cost comparisons against neural chunking methods. I'm looking for collaborators, institutional partners, or compute sponsorship to scale this work.
If you work in information retrieval, RAG systems, or document processing and find this research interesting, I'd love to hear from you.
This site is powered by AI tools and automation, including content generation and system monitoring.