Document Chunking Strategy Matters: When Structure-Aware Preprocessing Helps Dense Retrieval

Document chunking is a critical preprocessing step in retrieval-augmented generation (RAG) systems that has received limited empirical evaluation. We present a systematic empirical analysis of adaptive structure-aware chunking across four diverse domains, six embedding models, and five chunk sizes (355 successful experiments across 360 planned configurations). Our adaptive method detects document structure, preserves sentence boundaries, and enriches chunks with section metadata. Contrary to initial smaller-scale experiments which suggested no clear advantage, this expanded evaluation reveals improvements across three of four tested domains: scientific literature (+0.21%), medical documents (+0.66%), and argumentative text (+0.05%), with financial discussions showing a slight negative result (−0.12%). The overall improvement is +0.20% nDCG@10. While individual gains are modest, the consistency across three diverse domains suggests that structure-aware preprocessing provides benefits when documents have clear structural organization. Across the four BEIR datasets evaluated, performance gains track corpus structure indicators (composite r = 0.85); we treat this correlation as exploratory given n=4 and use it to motivate a practitioner-facing heuristic for predicting when adaptive chunking will help. This empirical finding aligns with concurrent theoretical work on hierarchical semantic entropy (Zhong et al., 2026), which predicts that entropy rate increases with corpus semantic complexity. We also find that instruction-tuned embedding models benefit substantially more (E5-Large: +1.29%) than classical models (MPNet: −0.32%), suggesting metadata enrichment acts as a surrogate instruction prefix.

✂️ How to Chunk 355 experiments 4 domains 6 models

👤 Stephen Soady Read PDF →

📄 PAPER 2 v4

Structure-Aware Conversational Memory: Metadata-Enriched Retrieval for Long-Running AI Agents

Metadata-as-prefix injection—prepending speaker, timestamp, topic, and structural annotations to conversation chunks before embedding—is an intuitive strategy for improving conversational memory retrieval. We present a comprehensive evaluation across seven embedding models, five metadata schemas, and three evaluation contexts totaling over 69,000 retrieval evaluations that reveals a striking divergence between benchmark and production performance. On the LoCoMo benchmark (1,986 questions, 5,882 turns), metadata enrichment produces large, statistically significant improvements across all seven models: the best combination (E5-large-v2 + Schema C) achieves MRR of 0.472, a +220% improvement over raw text (p < 10⁻¹⁵⁴, d = 0.77). All 28 schema-vs-baseline comparisons are significant after Bonferroni correction. However, production evaluation on real agent memory (120 queries, 817 chunks) reveals that the same enrichment strategy degrades retrieval: mxbai-embed-large shows −10% Recall@5 (p = 0.003). Embedding space analysis identifies the mechanism: metadata prefixes increase pairwise chunk similarity across all four tested models (all p < 0.001, Cohen's d = 1.08–1.88), collapsing the vector space and reducing discriminative power. This benchmark-production divergence carries a methodological warning for the field: metadata enrichment strategies must be validated on deployment-representative data, not benchmarks alone.

🏷️ Where Metadata Belongs 69,510 evaluations 7 models 25 references

👤 Stephen Soady Read PDF →

Research Arc

✂️ Chunk → 🏷️ Metadata → 📊 Entropy → 🔬 Discovery

📊 Research Summary

355

Experiments

4 datasets × 6 models × 5 sizes × 3 methods
v3 planned: 8 × 10 × 5 × 4 = 1,600

↑ +0.20%

Overall nDCG@10

Mean-of-per-model-means, common models

3/4

Datasets Improved

SciFact, NFCorpus, ArguAna

↑ +2.64%

Best: E5 × NFCorpus

Instruction-tuned model + medical docs

🔧 Figure 1: Adaptive Chunking Pipeline

⚡ Adaptive Method (Ours)

📄 Input Document

Raw text with potential structural markers (headings, paragraphs, lists)

→

🔍 Phase 1: Detect Structure

Parse Markdown ATX headers (#, ##), paragraph breaks, list markers → hierarchical map

→

✂️ Phase 2: Boundary-Aware Split

Split at nearest sentence boundary within budget (512 tokens → 2048 chars). Prefer section breaks. 10% overlap.

→

🏷️ Phase 3: Metadata Enrich

Prefix each chunk with section title: "Section: Related Work". Prefix counts toward budget.

→

📦 Output Chunks

Structure-aware chunks with metadata, ready for embedding

📏 Fixed Baseline

📄 Input

Raw text

→

✂️ Character Split

Split at exact character position (2048 chars). 10% overlap. Ignores all structure.

→

📦 Output

Raw chunks, may split mid-sentence or mid-word

🔄 Recursive Baseline

📄 Input

Raw text

→

🔄 Hierarchical Split

Try paragraphs → sentences → characters. LangChain RecursiveCharacterTextSplitter.

→

📦 Output

Better boundary respect, no metadata enrichment

🌡️ Figure 2: nDCG@10 Delta Heatmap (Adaptive - Fixed)

Relative % change in nDCG@10 (adaptive vs fixed), averaged across chunk sizes. ■ Blue = adaptive wins, ■ Orange = fixed wins. Values include +/− signs for accessibility.

Dataset	MPNet	BGE-Large	BGE-M3	BGE-Small	E5-Large	GTE-Large	Mean

📈 Figure 3: nDCG@10 by Chunk Size

Averaged across all datasets and models. Adaptive advantage is largest at 256 tokens (+1.04%).

Adaptive (Ours)

Recursive

Fixed

🤖 Model-Dependent Interactions

E5-Large (instruction-tuned) shows the strongest response to adaptive chunking, likely leveraging metadata prefixes as surrogate instructions.

Model	Type	SciFact Δ	NFCorpus Δ	FiQA Δ	ArguAna Δ	Mean Δ
E5-Large	Instruction-tuned	↑ +1.86%	↑ +2.64%	↓ −0.06%	↑ +0.71%	↑ +1.29%
GTE-Large	General	↑ +0.18%	↑ +0.45%	+0.04%	↑ +0.99%	↑ +0.41%
BGE-Small	Compact	↑ +0.24%	↑ +0.71%	↓ −0.20%	↓ −0.49%	+0.07%
BGE-Large	General	↓ −0.51%	↑ +0.17%	↑ +0.46%	+0.00%	+0.03%
MPNet	Classic	↑ +0.21%	↓ −0.55%	↓ −0.76%	↓ −0.19%	↓ −0.32%
BGE-M3	Multilingual	↓ −0.78%	↑ +0.51%	N/A	↓ −0.70%	↓ −0.32%

🔀 FiQA Metric Divergence (New Finding, v10)

FiQA shows a split: nDCG decreases while Recall and MRR improve - suggesting adaptive chunking on informal content improves coverage but slightly hurts precision-weighted ranking.

↓ −0.12%

nDCG@10

Precision-weighted ranking

↑ +0.85%

Recall@100

Coverage / finding relevant docs

↑ +0.61%

MRR

First relevant result quality

5/6

Models Tested

BGE-M3 excluded (CUDA OOM)

📐 Structure Availability Predicts Performance (New Finding, v12)

Datasets with titles and longer documents benefit most from adaptive chunking. Pearson correlation between composite structure score and nDCG@10 delta: r = 0.85.

Dataset	Titles	Avg Chars	Score	nDCG Δ
NFCorpus	100%	1,497	0.611	↑ +0.66%
SciFact	100%	1,401	0.599	↑ +0.21%
ArguAna	31%	1,007	0.299	+0.05%
FiQA	0%	767	0.169	↓ −0.12%

Key insight: Adaptive chunking benefits scale with document structure availability. Datasets with titles (metadata source) and longer documents (more chunking opportunities) see the largest gains. FiQA's 0% title availability leaves nothing for metadata enrichment to work with.

Steve Soady

Independent researcher exploring machine learning, retrieval systems, and AI applications. Based in Brisbane, Australia.

💻 GitHub 🔬 ORCID 📧 Contact

About

I'm an independent researcher focused on practical applications of machine learning, with particular interest in retrieval-augmented generation (RAG) systems, document processing, and AI tooling. My work combines empirical evaluation with hands-on system building.

My recent research examines how document preprocessing strategies affect retrieval performance across different domains. I believe in open, reproducible research and sharing both successes and failures to advance the field.

When I'm not researching, you might find me following Formula 1, experimenting with new AI models, or building tools to make research more efficient and accessible.

🤝 Open to Collaboration

An expanded benchmark (v3) is designed and ready to execute: 8 BEIR datasets, 10 embedding models, ~1,600 experiments across four methods — Fixed, Recursive, Adaptive, and a new Adaptive-NoMeta ablation that isolates boundary-awareness from metadata injection. This includes per-query Wilcoxon signed-rank tests with bootstrap confidence intervals, document-level sensitivity analysis, and computational cost comparisons against neural chunking methods. I'm looking for collaborators, institutional partners, or compute sponsorship to scale this work.

If you work in information retrieval, RAG systems, or document processing and find this research interesting, I'd love to hear from you.

📧 steve@clawsome.dev 🔬 ORCID Profile

This site is powered by AI tools and automation, including content generation and system monitoring.