datastream/text
Chunk-boundary-aware text stream operations on Stream(String).
Real text inputs arrive in arbitrary chunks (file reads, network
reads). A naïve string.split per chunk is wrong because line
terminators and delimiters land on chunk boundaries. The functions
here absorb that with a small, bounded amount of internal buffering
— they never materialise the full input.
Both functions consume Stream(String) rather than
Stream(BitArray). The byte → string boundary is owned by
text.utf8_decode (separate issue) so the cost lives in one place.
Values
pub fn lines(
over stream: datastream.Stream(String),
) -> datastream.Stream(String)
Split a stream of strings into a stream of lines, treating \n,
\r\n, and lone \r as terminators (matching Python’s
str.splitlines()).
The terminator is NOT included in emitted lines. A trailing partial line (input that does not end with a terminator) is emitted as a final element so callers do not silently drop the last record.
Chunk-boundary handling: when a \r lands at the end of one chunk
and the next chunk starts with \n, the pair is treated as a
single \r\n boundary. A lone \r with no following \n (either
because the next chunk does not start with \n, or because the
stream ended) is treated as a line terminator.
pub fn split(
over stream: datastream.Stream(String),
on delimiter: String,
) -> datastream.Stream(String)
Split a stream of strings on delimiter, emitting every separated
piece including the empty pieces between consecutive delimiters and
at the start / end of the input.
delimiter == "" triggers the grapheme-cluster splitter: each
emitted element is exactly one Unicode grapheme cluster of the
input, in source order.
Chunk-boundary handling: pieces that span a chunk boundary are joined; the trailing in-flight piece is held in a small buffer and emitted only when a delimiter or end-of-stream is reached.
pub fn utf8_decode(
over stream: datastream.Stream(BitArray),
) -> datastream.Stream(Result(String, Nil))
Decode a stream of UTF-8 byte chunks into a stream of decoded strings, reassembling multi-byte codepoints split across chunks.
Each successfully decoded codepoint surfaces as Ok(string). An
invalid byte (lead byte that is not a valid UTF-8 start, or a
missing/invalid continuation) emits exactly one Error(Nil) and
decoding resumes from the next byte. An incomplete trailing
sequence at end-of-stream emits a final Error(Nil) before halting.
The decoder holds a small internal buffer for partial multi-byte codepoints; it never materialises the full input.
pub fn utf8_encode(
over stream: datastream.Stream(String),
) -> datastream.Stream(BitArray)
Encode a stream of strings as UTF-8 byte chunks.
Each input string maps to its UTF-8 encoding as a single
BitArray element; an empty input string maps to the empty
BitArray.