What Is FAMD? A Clear, Beginner-Friendly Guide
What FAMD stands for
FAMD = Factor Analysis of Mixed Data.
Purpose
FAMD is a dimension-reduction technique designed to handle datasets that contain both continuous (numeric) and categorical variables. It produces a low-dimensional representation that captures the main patterns and relationships across mixed-type features.
When to use it
- Your dataset mixes numeric and categorical variables.
- You want to visualize structure (clusters, gradients) in 2–3 dimensions.
- You need to reduce dimensionality before clustering or visualization while preserving contributions from both variable types.
How it works (overview)
- Numeric variables are centered and scaled.
- Categorical variables are converted to indicator (dummy) variables and weighted so each categorical variable contributes comparably to the analysis.
- A singular value decomposition (SVD) or equivalent eigen-decomposition is applied to the combined, weighted matrix to extract principal components (dimensions).
- The resulting components are interpreted similarly to PCA: coordinates for observations and loadings for variables, but adjusted to account for mixed types.
Output and interpretation
- Individual coordinates: each observation gets coordinates on principal dimensions (useful for scatterplots, clustering).
- Variable contributions: numeric variables have loadings; categorical variables show category coordinates and contribution measures.
- Explained variance: each dimension has an associated eigenvalue indicating how much variance it explains (interpreted with caution because of mixed scaling).
Practical tips
- Standardize numeric variables if they have different units or scales.
- Rare categories can dominate; consider combining rare levels.
- Use biplots to visualize individuals and variable contributions together.
- Retain only the first few dimensions that explain substantial variance for downstream tasks.
- Implementations available in R (FactoMineR::FAMD) and Python (prince library).
Example use cases
- Customer datasets with demographics (categorical) and spending (numeric).
- Survey data combining Likert scales and categorical responses.
- Medical records with lab values and diagnosis codes.
Quick workflow (steps)
- Clean data, handle missing values.
- Encode categorical variables (most FAMD implementations handle this internally).
- Standardize numeric variables.
- Run FAMD and inspect eigenvalues.
- Plot individuals on first two dimensions; examine variable contributions.
- Use coordinates for clustering or predictive models.
Limitations
- Interpretation of mixed-variable variance is less straightforward than PCA.
- Sensitive to scaling and rare categories.
- Computational cost grows with many categories (high-dimensional dummy encoding).
If you want, I can run an FAMD example on a sample dataset (R or Python) and show code + plots.
Leave a Reply