Skip to content

Dataset

verifia.context.Dataset

Represents a dataset containing features and a target label. Provides functionality for sampling, and supports filtering based on domain-specific criteria.

X property

Returns:

Type Description
NDArray

npt.NDArray: NumPy array containing the feature data.

cat_feature_idxs property

Return the list of indices corresponding to categorical features.

Returns:

Type Description

List[int]: A list of indices in self._feature_names for categorical features.

cat_feature_names property

Return the list of categorical feature names.

Returns:

Type Description

List[str]: A list of categorical feature names.

data property

Returns:

Type Description
DataFrame

pd.DataFrame: A copy of the internal DataFrame containing the dataset.

n_samples property

Returns:

Name Type Description
int int

The number of samples in the dataset.

num_feature_idxs property

Return the list of indices corresponding to numeric (non-categorical) features.

Returns:

Type Description

List[int]: A list of indices in self._feature_names for numeric features.

num_feature_names property

Return the list of numerical feature names in their original order.

Returns:

Type Description

List[str]: A list of feature names that are not categorical.

target_data property

Retrieve the target column from the dataset.

Returns:

Type Description
Series

pd.Series: A Series corresponding to the target variable specified by self._target_name.

y property

Returns:

Type Description
NDArray

npt.NDArray: NumPy array containing the target (label) data.

__init__(df, target_name, feature_names=None, cat_feature_names=None)

Initializes the Dataset instance by validating the input DataFrame and inferring categorical features if not provided.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame containing the data.

required
target_name str

Name of the target (label) column.

required
feature_names Optional[Iterable[str]]

Iterable of all feature names. If None, all features names are automatically inferred from columns in the provided dataframe.

None
cat_feature_names Optional[Iterable[str]]

Iterable of categorical feature names. If None, categorical features are automatically inferred from columns in feature_names with dtype 'object'.

None

Raises:

Type Description
ValueError

If the DataFrame does not contain all required columns.

__len__()

Returns:

Name Type Description
int int

The number of samples in the dataset.

feature_data(enforced_data_type=False)

Retrieve the features subset from the dataset.

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame containing only the columns specified in self._feature_names.

sample(n_samples=None, prop_samples=None, replace=False, random_state=None)

Creates a new Dataset instance by sampling rows from the current dataset.

Either n_samples or prop_samples must be provided. If both are provided, n_samples takes precedence.

Parameters:

Name Type Description Default
n_samples Optional[int]

The exact number of samples to extract. Must be positive.

None
prop_samples Optional[float]

Proportion of the total samples to extract (between 0 and 1).

None
replace bool

Whether sampling is done with replacement. Defaults to False.

False
random_state Optional[Union[int, Generator]]

Seed or Generator for reproducible sampling.

None

Returns:

Name Type Description
Dataset Dataset

A new Dataset instance containing the sampled data.

Raises:

Type Description
ValueError

If neither n_samples nor prop_samples is provided, if provided values are out of range, or if sampling without replacement and n_samples exceeds available data.

split(primary_split_size, random_state=None)

Split the dataset into a primary split and a secondary split.

Parameters:

Name Type Description Default
primary_split_size float

The fraction of the dataset to include in the primary split.

required
random_state Optional[int]

Seed for reproducibility. Defaults to None.

None

Returns:

Type Description
Tuple[Dataset, Dataset]

Tuple[Dataset, Dataset]: A tuple containing: - A Dataset for the primary split. - A Dataset for the secondary split.