|  View source on GitHub | 
Dataset library for text classifier.
Inherits From: ClassificationDataset, Dataset
mediapipe_model_maker.text_classifier.Dataset(
    dataset: tf.data.Dataset,
    label_names: List[str],
    tfrecord_cache_files: Optional[cache_files_lib.TFRecordCacheFiles] = None,
    size: Optional[int] = None
)
| Attributes | |
|---|---|
| label_names | |
| num_classes | |
| size | Returns the size of the dataset. Same functionality as calling len. See the len method definition for more information. | 
Methods
from_csv
@classmethodfrom_csv( filename: str, csv_params:mediapipe_model_maker.text_classifier.CSVParams, shuffle: bool = True, cache_dir: Optional[str] = None, num_shards: int = 1 ) -> 'Dataset'
Loads text with labels from a CSV file.
| Args | |
|---|---|
| filename | Name of the CSV file. | 
| csv_params | Parameters used for reading the CSV file. | 
| shuffle | If True, randomly shuffle the data. | 
| cache_dir | Optional parameter to specify where to store the preprocessed dataset. Only used for BERT models. | 
| num_shards | Optional parameter for num shards of the preprocessed dataset. Note that using more than 1 shard will reorder the dataset. Only used for BERT models. | 
| Returns | |
|---|---|
| Dataset containing (text, label) pairs and other related info. | 
gen_tf_dataset
gen_tf_dataset(
    batch_size: int = 1,
    is_training: bool = False,
    shuffle: bool = False,
    preprocess: Optional[Callable[..., Any]] = None,
    drop_remainder: bool = False
) -> tf.data.Dataset
Generates a batched tf.data.Dataset for training/evaluation.
| Args | |
|---|---|
| batch_size | An integer, the returned dataset will be batched by this size. | 
| is_training | A boolean, when True, the returned dataset will be optionally shuffled and repeated as an endless dataset. | 
| shuffle | A boolean, when True, the returned dataset will be shuffled to create randomness during model training. | 
| preprocess | A function taking three arguments in order, feature, label and boolean is_training. | 
| drop_remainder | boolean, whether the finally batch drops remainder. | 
| Returns | |
|---|---|
| A TF dataset ready to be consumed by Keras model. | 
split
split(
    fraction: float
) -> Tuple[ds._DatasetT, ds._DatasetT]
Splits dataset into two sub-datasets with the given fraction.
Primarily used for splitting the data set into training and testing sets.
| Args | |
|---|---|
| fraction | float, demonstrates the fraction of the first returned subdataset in the original data. | 
| Returns | |
|---|---|
| The splitted two sub datasets. | 
__len__
__len__() -> int
Returns the number of element of the dataset.
If size is not set, this method will fallback to using the len method of the tf.data.Dataset in self._dataset. Calling len on a tf.data.Dataset instance may throw a TypeError because the dataset may be lazy-loaded with an unknown size or have infinite size.
In most cases, however, when an instance of this class is created by helper functions like 'from_folder', the size of the dataset will be preprocessed, and the _size instance variable will be already set.
| Raises | |
|---|---|
| TypeError if self._size is not set and the cardinality of self._dataset is INFINITE_CARDINALITY or UNKNOWN_CARDINALITY. |