Text Extraction & Whitespace
Text Extraction
Section titled “Text Extraction”Handling text in a markup compiler is deceptively complex. Vyasac isolates this logic in src/utils.rs via the extract_text function.
The Problem
Section titled “The Problem”When extracting “text” from a node tree, we might want:
- Plain Text: For titles, metadata, or search indexing.
- HTML Fragment: For embedding in attributes or specific output contexts.
Additionally, concepts like SegmentBreak (newline) need different representations in each mode.
Implementation: utils::extract_text
Section titled “Implementation: utils::extract_text”The signature is:
pub fn extract_text(nodes: &[Node], mode: &ExtractMode) -> StringExtractMode
Section titled “ExtractMode”pub enum ExtractMode { Plain, // Breaks become ' ', tags are stripped Html, // Breaks become '<br />', tags are preserved?? (Actually assumes content)}-
Plain Mode:
SegmentBreakconverts to a space. This ensures that a multi-line title in source:`title {TheBhagavad Gita}becomes
"The Bhagavad Gita"and not"TheBhagavad Gita". -
Html Mode:
SegmentBreakconverts to<br />.
Whitespace Stripping
Section titled “Whitespace Stripping”By default, the compiler strips “insignificant” whitespace (whitespace-only text nodes that contain newlines) to keep the AST clean. Exceptions:
- Inline Whitespace: “Hello World” (space preserved).
- Preserve Mode: Commands defined with
whitespace="preserve"(e.g.,code,pre) opt-out of stripping.
This logic resides in src/pipeline.rs (strip_whitespace).