Hierarchical classification is a task in machine learning where the goal is to assign an instance to one or more classes organized in a hierarchy, rather than choosing from a flat label set. This structure can improve prediction accuracy and make outputs more interpretable.
Hierarchical classification assigns instances to labels that are part of a structured taxonomy, where labels may have parent-child relationships. Instead of treating categories as independent, it models the relationships among them to better reflect the data's semantics.
Types of Hierarchical Structures
1) Tree Hierarchy
- Each node has exactly one parent (except the root).
- Every instance is assigned a unique path from the root to a leaf.
- Example: Animal → Mammal → Dog
2) DAG (Directed Acyclic Graph)
- A node can have multiple parents.
- Useful when concepts belong to multiple categories.
- Example: "Tablet" can belong to both "Electronics" and "Computing Devices"
3) Taxonomy
- A domain-specific organizational structure that can be a tree or DAG.
- Adds semantic meaning to the labels (e.g., product taxonomy in retail, medical coding in healthcare).
Why Use Hierarchical Classification?
Aspect | Flat Classification | Hierarchical Classification |
|---|---|---|
Output | Single Label | Label with hierarchy (e.g., path) |
Error penalty | Equal for all errors | Penalizes mistakes at higher levels more |
Interpretability | Moderate | High (provides structured output) |
Use Cases and Applications
- Medical diagnosis (ICD coding)
- Product categorization in e-commerce
- Document topic classification
- Biological classification (taxonomy)
- News categorization by topics and subtopics
Methods of Hierarchical Classification
1. Local Classifier per Node
- A binary classifier is trained for each node to decide whether an instance belongs to that class.
- Prediction proceeds top-down from the root.
2. Local Classifier per Parent Node
- For each internal node, a multi-class classifier is trained to distinguish among its child nodes.
- This reduces the number of classifiers but may increase complexity at each node.
3. Local Classifier per Level
- One classifier per hierarchy level.
- Useful when hierarchy is well-balanced.
4. Global Classifier
- A single model is trained to consider the full hierarchy.
- Often requires custom loss functions to enforce structural constraints.
5. Constraint-Based Models
- Uses the hierarchy during inference (and optionally training) to enforce logical constraints.
- Example: If a child node is predicted, all its ancestors must also be predicted.
Hierarchical Cross-Entropy Loss
To account for the hierarchical structure in the loss function, we can use hierarchical cross-entropy loss, which penalizes errors at higher levels more heavily:
L = -\sum_{i=1}^{N} \sum_{j \in \mathcal{A}(y_i)} \log P(j \mid x_i)
where:
N is the number of training samples,y_i is the true label for instancex_i ,\mathcal{A}(y_i) is the set of ancestors ofy_i , includingy_i itself.
Evaluation Metrics
- Hierarchical Precision / Recall: Evaluate precision and recall at all levels of the hierarchy.
- H-loss: Penalizes incorrect ancestor or descendant predictions.
- Path Accuracy: Accuracy of the entire predicted path.
Tools and Libraries
- scikit-multilearn for hierarchical multi-label classification
- keras-han (for hierarchical attention networks)
- Custom architectures using PyTorch or TensorFlow
- Graph Neural Networks: To learn hierarchical embeddings over DAGs
Challenges
- Data sparsity in deeper levels of hierarchy.
- Error propagation in top-down models.
- Scalability for large taxonomies.
- Imbalanced data due to uneven class distribution.