Learn how to improve your valuation models with data enhancements
Enhancements are data transformations that can improve your regression model's accuracy and interpretability. They address common data issues like skewness, outliers, non-linear relationships, and scale differences.
Applies the natural logarithm (ln) to all sale prices. This compresses high values more than low values, reducing right-skewness and stabilizing variance.
Scenario: You're valuing homes in a neighborhood where most properties sell for $200k-$300k, but a few luxury homes sell for $800k-$1.2M.
Without log transform: The model tries to fit both ranges with a straight line, resulting in poor predictions for typical homes (R² = 0.42).
With log transform: The model captures the percentage relationship between features and price, improving fit across all price ranges (R² = 0.78).
Identifies and removes extreme data points using the IQR (Interquartile Range) method. Points beyond 1.5×IQR from Q1/Q3 quartiles are excluded from the analysis.
Scenario: You're analyzing 25 comparable sales. One property sold for $150k (foreclosure, needs major repairs), while others range from $240k-$310k.
Without outlier removal: The $150k sale pulls the regression line down, undervaluing all typical properties (R² = 0.51, predictions $15k-$20k too low).
With outlier removal: The distressed sale is excluded, and the model accurately represents the typical market (R² = 0.82, predictions within $5k).
Adds squared terms (e.g., GLA²) to the regression model. This allows the model to capture curved relationships instead of forcing everything into straight lines.
Scenario: You're valuing properties ranging from 1,000 to 4,000 sq ft. Small homes ($150/sq ft) and large homes ($110/sq ft) have different price-per-sq-ft values.
Without polynomial terms: Linear model assumes constant $130/sq ft, overvaluing small homes and undervaluing large ones (R² = 0.65).
With polynomial terms: Model captures the curved relationship, accurately pricing homes across all sizes (R² = 0.84).
Converts all features to the same scale (z-scores) by subtracting the mean and dividing by standard deviation. This makes coefficients directly comparable.
Scenario: You want to know which matters more for price: property age or gross living area.
Without standardization: Coefficients are +$85/sq ft and -$1,200/year. Hard to compare because units are different.
With standardization: Coefficients are +$18,500/std and -$8,200/std. Now you can see GLA has 2.3× more impact than age on price.
| Enhancement | Primary Benefit | Typical R² Impact | Best For |
|---|---|---|---|
| Log Transform | Reduces skewness | +5% to +20% | Right-skewed prices |
| Remove Outliers | Removes noise | +3% to +15% | Data with errors/extremes |
| Polynomial Terms | Captures curves | +2% to +10% | Non-linear relationships |
| Standardize Features | Improves interpretability | 0% (no change) | Comparing feature importance |