Data
VerifIA’s verification pipeline is entirely data‑driven. The dataset you supply becomes the original seed from which VerifIA derives novel, out‑of‑sample inputs. Evaluating your model on these derived samples lets VerifIA measure its generalizability beyond the labeled data by verifying consistency with your application’s domain knowledge.
1. Seed Data
- The original seed is never evaluated “as‑is.”
- It powers the generation of all verification inputs.
- You may include:
- Training data (examples already seen by the model)
- Validation or test data (often yield more challenging derived inputs, since those regions tend to be less covered by the model)
VerifIA offers three ways to attach your seed data:
from verifia.verification import RuleConsistencyVerifier
verifier = RuleConsistencyVerifier("domain_rules.yaml")
# Supported formats: CSV, Excel, JSON, Parquet, Feather, Pickle
verifier = verifier.on(data_fpath="path/to/data.csv")
Format | Extension |
---|---|
CSV | .csv |
Excel | .xls , .xlsx |
JSON | .json |
Parquet | .parquet |
Feather | .feather |
Pickle | .pkl |
If your data comes from SQL, NoSQL, or APIs, load it into a DataFrame and pass it directly:
To handle metadata (e.g., categorical features), wrap your DataFrame in Dataset
:
from verifia.context.data import Dataset
from verifia.verification import RuleConsistencyVerifier
# 1. Load raw data
import pandas as pd
df = pd.read_csv("path/to/data.csv")
# 2. Define metadata
target = "predicted_label"
features = ["feat_1", "feat_2", ...]
cat_features = ["cat_feat_1", "cat_feat_2"]
# 3. Create Dataset
ds = Dataset(df, target, features, cat_features)
# 4. Initialize verifier and attach Dataset
verifier = RuleConsistencyVerifier("domain_rules.yaml")
verifier = verifier.on(dataset=ds)
2. Derived Inputs
VerifIA applies your configured search strategy (population size, iterations) and your defined domain (variables' ranges,constraints, and rules) to the original seed—to explore new in‑domain datapoints. This approach probes beyond perfectly in‑distribution regions represented by labeled examples.
3. Out‑of‑Domain Removal
- Validate each seed row against its feature metadata (ranges, types, constraints).
- Remove rows with aberrant values (out‑of‑range, undefined) to prevent generation errors.
- Report includes the count of removed out-of-domain seeds.
Warning
A high out‑of‑domain count can indicate:
- Noisy or invalid seed data
- Incomplete or overly restrictive metadata
4. Misprediction Exclusion
To avoid biasing verification by known model errors, VerifIA excludes any seed row where the model already fails:
- Misclassification: predicted class ≠ ground truth
- High Regression Error: |prediction – ground truth| > allowed tolerance
High Misprediction Rate
- Issue: A high number of exclusions signals that the model needs further optimization before advanced behavioral tests.
- Action: Retrain or fine‑tune until mispredictions on validation/test examples drop to an acceptable level, then proceed with VerifIA's verifications.
The report will show the count of excluded, mispredicted seeds.
5. Seed Diversity
- A diverse, well‑dispersed seed lets VerifIA cover more of the input space.
Avoid sampling original seeds in narrow input space regions
- Visualize feature histograms, pairwise plots, or scatter plots to ensure broad coverage.
- Perform unsupervised clustering (e.g., k‑means, DBSCAN) or compute pairwise distance metrics.
6. Verification Budget & Runtime
- Each seed row spawns many derived trials (population size × iterations × rules).
- Adding seeds ≠ linear runtime—more seeds can dramatically increase run time.
- Balance seed diversity against your compute budget; tune search parameters accordingly.
Supported Data Types
- Current: Tabular data
- Future: Images, natural‑language text, audio, and beyond