Dataset
verifia.context.Dataset
Represents a dataset containing features and a target label. Provides functionality for sampling, and supports filtering based on domain-specific criteria.
X
property
Returns:
Type | Description |
---|---|
NDArray
|
npt.NDArray: NumPy array containing the feature data. |
cat_feature_idxs
property
Return the list of indices corresponding to categorical features.
Returns:
Type | Description |
---|---|
List[int]: A list of indices in self._feature_names for categorical features. |
cat_feature_names
property
Return the list of categorical feature names.
Returns:
Type | Description |
---|---|
List[str]: A list of categorical feature names. |
data
property
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A copy of the internal DataFrame containing the dataset. |
n_samples
property
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
The number of samples in the dataset. |
num_feature_idxs
property
Return the list of indices corresponding to numeric (non-categorical) features.
Returns:
Type | Description |
---|---|
List[int]: A list of indices in self._feature_names for numeric features. |
num_feature_names
property
Return the list of numerical feature names in their original order.
Returns:
Type | Description |
---|---|
List[str]: A list of feature names that are not categorical. |
target_data
property
Retrieve the target column from the dataset.
Returns:
Type | Description |
---|---|
Series
|
pd.Series: A Series corresponding to the target variable specified by self._target_name. |
y
property
Returns:
Type | Description |
---|---|
NDArray
|
npt.NDArray: NumPy array containing the target (label) data. |
__init__(df, target_name, feature_names=None, cat_feature_names=None)
Initializes the Dataset instance by validating the input DataFrame and inferring categorical features if not provided.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing the data. |
required |
target_name
|
str
|
Name of the target (label) column. |
required |
feature_names
|
Optional[Iterable[str]]
|
Iterable of all feature names. If None, all features names are automatically inferred from columns in the provided dataframe. |
None
|
cat_feature_names
|
Optional[Iterable[str]]
|
Iterable of categorical feature names. If None, categorical features are automatically inferred from columns in feature_names with dtype 'object'. |
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If the DataFrame does not contain all required columns. |
__len__()
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
The number of samples in the dataset. |
feature_data(enforced_data_type=False)
Retrieve the features subset from the dataset.
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A DataFrame containing only the columns specified in self._feature_names. |
sample(n_samples=None, prop_samples=None, replace=False, random_state=None)
Creates a new Dataset instance by sampling rows from the current dataset.
Either n_samples
or prop_samples
must be provided. If both are provided,
n_samples
takes precedence.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_samples
|
Optional[int]
|
The exact number of samples to extract. Must be positive. |
None
|
prop_samples
|
Optional[float]
|
Proportion of the total samples to extract (between 0 and 1). |
None
|
replace
|
bool
|
Whether sampling is done with replacement. Defaults to False. |
False
|
random_state
|
Optional[Union[int, Generator]]
|
Seed or Generator for reproducible sampling. |
None
|
Returns:
Name | Type | Description |
---|---|---|
Dataset |
Dataset
|
A new Dataset instance containing the sampled data. |
Raises:
Type | Description |
---|---|
ValueError
|
If neither |
split(primary_split_size, random_state=None)
Split the dataset into a primary split and a secondary split.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
primary_split_size
|
float
|
The fraction of the dataset to include in the primary split. |
required |
random_state
|
Optional[int]
|
Seed for reproducibility. Defaults to None. |
None
|
Returns:
Type | Description |
---|---|
Tuple[Dataset, Dataset]
|
Tuple[Dataset, Dataset]: A tuple containing: - A Dataset for the primary split. - A Dataset for the secondary split. |