Dataset

`verifia.context.Dataset`

Represents a dataset containing features and a target label. Provides functionality for sampling, and supports filtering based on domain-specific criteria.

`X` `property`

Returns:

Type	Description
`NDArray`	npt.NDArray: NumPy array containing the feature data.

`cat_feature_idxs` `property`

Return the list of indices corresponding to categorical features.

Returns:

Type	Description
	List[int]: A list of indices in self._feature_names for categorical features.

`cat_feature_names` `property`

Return the list of categorical feature names.

Returns:

Type	Description
	List[str]: A list of categorical feature names.

`data` `property`

Returns:

Type	Description
`DataFrame`	pd.DataFrame: A copy of the internal DataFrame containing the dataset.

`n_samples` `property`

Returns:

Name	Type	Description
`int`	`int`	The number of samples in the dataset.

`num_feature_idxs` `property`

Return the list of indices corresponding to numeric (non-categorical) features.

Returns:

Type	Description
	List[int]: A list of indices in self._feature_names for numeric features.

`num_feature_names` `property`

Return the list of numerical feature names in their original order.

Returns:

Type	Description
	List[str]: A list of feature names that are not categorical.

`target_data` `property`

Retrieve the target column from the dataset.

Returns:

Type	Description
`Series`	pd.Series: A Series corresponding to the target variable specified by self._target_name.

`y` `property`

Returns:

Type	Description
`NDArray`	npt.NDArray: NumPy array containing the target (label) data.

`init(df, target_name, feature_names=None, cat_feature_names=None)`

Initializes the Dataset instance by validating the input DataFrame and inferring categorical features if not provided.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame containing the data.	required
`target_name`	`str`	Name of the target (label) column.	required
`feature_names`	`Optional[Iterable[str]]`	Iterable of all feature names. If None, all features names are automatically inferred from columns in the provided dataframe.	`None`
`cat_feature_names`	`Optional[Iterable[str]]`	Iterable of categorical feature names. If None, categorical features are automatically inferred from columns in feature_names with dtype 'object'.	`None`

Raises:

Type	Description
`ValueError`	If the DataFrame does not contain all required columns.

`len()`

Returns:

Name	Type	Description
`int`	`int`	The number of samples in the dataset.

`feature_data(enforced_data_type=False)`

Retrieve the features subset from the dataset.

Returns:

Type	Description
`DataFrame`	pd.DataFrame: A DataFrame containing only the columns specified in self._feature_names.

`sample(n_samples=None, prop_samples=None, replace=False, random_state=None)`

Creates a new Dataset instance by sampling rows from the current dataset.

Either n_samples or prop_samples must be provided. If both are provided, n_samples takes precedence.

Parameters:

Name	Type	Description	Default
`n_samples`	`Optional[int]`	The exact number of samples to extract. Must be positive.	`None`
`prop_samples`	`Optional[float]`	Proportion of the total samples to extract (between 0 and 1).	`None`
`replace`	`bool`	Whether sampling is done with replacement. Defaults to False.	`False`
`random_state`	`Optional[Union[int, Generator]]`	Seed or Generator for reproducible sampling.	`None`

Returns:

Name	Type	Description
`Dataset`	`Dataset`	A new Dataset instance containing the sampled data.

Raises:

Type	Description
`ValueError`	If neither `n_samples` nor `prop_samples` is provided, if provided values are out of range, or if sampling without replacement and `n_samples` exceeds available data.

`split(primary_split_size, random_state=None)`

Split the dataset into a primary split and a secondary split.

Parameters:

Name	Type	Description	Default
`primary_split_size`	`float`	The fraction of the dataset to include in the primary split.	required
`random_state`	`Optional[int]`	Seed for reproducibility. Defaults to None.	`None`

Returns:

Type	Description
`Tuple[Dataset, Dataset]`	Tuple[Dataset, Dataset]: A tuple containing: - A Dataset for the primary split. - A Dataset for the secondary split.

Dataset

verifia.context.Dataset

X property

cat_feature_idxs property

cat_feature_names property

data property

n_samples property

num_feature_idxs property

num_feature_names property

target_data property

y property

__init__(df, target_name, feature_names=None, cat_feature_names=None)

__len__()

feature_data(enforced_data_type=False)

sample(n_samples=None, prop_samples=None, replace=False, random_state=None)

split(primary_split_size, random_state=None)

`verifia.context.Dataset`

`X` `property`

`cat_feature_idxs` `property`

`cat_feature_names` `property`

`data` `property`

`n_samples` `property`

`num_feature_idxs` `property`

`num_feature_names` `property`

`target_data` `property`

`y` `property`

`init(df, target_name, feature_names=None, cat_feature_names=None)`

`len()`

`feature_data(enforced_data_type=False)`

`sample(n_samples=None, prop_samples=None, replace=False, random_state=None)`

`split(primary_split_size, random_state=None)`