Create an iterator for splitting binary or character input into a dataframe

idstrsplit takes a binary connection or character vector (which is interpreted as a file name) and splits it into a series of dataframes according to the separator.

idstrsplit(x, col_types, sep="|", nsep=NA, strict=TRUE, 
           max.line = 65536L, max.size = 33554432L)

Arguments

x

character vector (each element is treated as a row) or a raw vector (newlines separate rows)

col_types

required character vector or a list. A vector of classes to be assumed for the output dataframe. If it is a list, class(x)[1] will be used to determine the class of the contained element. It will not be recycled, and must be at least as long as the longest row if strict is TRUE.

Possible values are "NULL" (when the column is skipped) one of the six atomic vector types ('character', 'numeric', 'logical', 'integer', 'complex', 'raw') or POSIXct. 'POSIXct' will parse date format in the form "YYYY-MM-DD hh:mm:ss.sss" assuming GMT time zone. The separators between digits can be any non-digit characters and only the date part is mandatory. See also fasttime::asPOSIXct for details.

sep

single character: field (column) separator. Set to NA for no seperator; in other words, a single column.

nsep

index name separator (single character) or NA if no index names are included

strict

logical, if FALSE then dstrsplit will not fail on parsing errors, otherwise input not matching the format (e.g. more columns than expected) will cause an error.

max.line

maximum length of one line (in byets) - determines the size of the read buffer, default is 64kb

max.size

maximum size of the chunk (in bytes), default is 32Mb

Details

If nsep is specified then all characters up to (but excluding) the occurrence of nsep are treated as the index name. The remaining characters are split using the sep character into fields (columns). dstrsplit will fail with an error if any line contains more columns then expected unless strict is FALSE. Excessive columns are ignored in that case. Lines may contain fewer columns in which case they are set to NA.

Note that it is legal to use the same separator for sep and nsep in which case the first field is treated as a row name and subsequent fields as data columns.

If nsep is specified, the output of dstrsplit contains an extra column called 'rowindex' containing the row index. This is used instead of the rownames to allow for duplicated indicies (which are checked for and not allowed in a dataframe, unlike the case with a matrix).

Value

idstrsplit returns an iterator (closure). When nextElem is called on the iterator a data.frame is returned with as many rows as they are lines in the input and as many columns as there are non-NULL values in col_types, plus an additional column if nsep is specified. The colnames (other than the row index) are set to 'V' concatenated with the column number unless col_types is a named vector in which case the names are inherited.

Author

Michael Kane

Examples

col_names <- names(iris)
write.csv(iris, file="iris.csv", row.names=FALSE)
it <- idstrsplit("iris.csv", col_types=c(rep("numeric", 4), "character"), 
                 sep=",")
# Get the elements
iris_read <- it$nextElem()[-1,]
# or with the iterators package
# nextElem(it)
names(iris_read) <- col_names
print(head(iris_read))
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width  Species
#> 2          5.1         3.5          1.4         0.2 "setosa"
#> 3          4.9         3.0          1.4         0.2 "setosa"
#> 4          4.7         3.2          1.3         0.2 "setosa"
#> 5          4.6         3.1          1.5         0.2 "setosa"
#> 6          5.0         3.6          1.4         0.2 "setosa"
#> 7          5.4         3.9          1.7         0.4 "setosa"

## remove iterator, connections and files
rm("it")
gc(FALSE)
#>           used (Mb) gc trigger  (Mb) max used  (Mb)
#> Ncells  922693 49.3    1780452  95.1  1780452  95.1
#> Vcells 1724734 13.2   36964543 282.1 46005749 351.0
unlink("iris.csv")