chunk {iotools}R Documentation

Functions for very fast chunk-wise processing

Description

chunk.reader creates a reader that will read from a binary connection in chunks while preserving integrity of lines.

read.chunk reads the next chunk using the specified reader.

Usage

chunk.reader(source, max.line = 65536L, sep = NULL)
read.chunk(reader, max.size = 33554432L, timeout = Inf)

Arguments

source

binary connection or character (which is interpreted as file name) specifying the source

max.line

maximum length of one line (in byets) - determines the size of the read buffer, default is 64kb

sep

optional string: key separator if key-aware chunking is to be used

character is considered a key and subsequent records holding the same key are guaranteed to be

reader

reader object as returned by chunk.reader

max.size

maximum size of the chunk (in bytes), default is 32Mb

timeout

numeric, timeout (in seconds) for reads if source is a raw file descriptor.

Details

chunk.reader is essentially a filter that converts binary connection into chunks that can be subsequently parsed into data while preserving the integrity of input lines. read.chunk is used to read the actual chunks. The implementation is very thin to prevert copying of large vectors for best efficiency.

If sep is set to a string, it is treated as a single-character separator character. If specified, prefix in the input up to the specified character is treated as a key and subsequent lines with the same key are guaranteed to be processed in the same chunk. Note that this implies that the chunk size is practically unlimited, since this may force accumulation of multiple chunks to satisfy this condition. Obviously, this increases the processing and memory overhead.

In addition to connections chunk.reader supports raw file descriptors (integers of the class "fileDescriptor"). In that case the reads are preformed directly by chunk.reader and timeout can be used to perform non-blocking or timed reads (unix only, not supported on Windows).

Value

chunk.reader returns an object that can be used by read.chunk. If source is a string, it is equivalent to calling chunk.reader(file(source, "rb"), ...).

read.chunk returns a raw vector holding the next chunk or NULL if timeout was reached. It is deliberate that read.chunk does NOT return a character vector since that would reasult in a high performance penalty. Please use the appropriate parser to convert the chunk into data, see mstrsplit.

Author(s)

Simon Urbanek


[Package iotools version 0.3-3 Index]