Flooding is a major planning challenge because it can damage infrastructure, disrupt transportation, threaten public safety, and create long-term recovery costs. In this project, a logistic regression model was used to estimate the probability that individual 60-meter grid cells would be inundated during a major flood event. The model was trained on observed flood data from Calgary’s 2013 flood and then applied to Salt Lake City as a comparable city. This approach is useful for planning because it uses environmental and built-form conditions to estimate relative flood exposure across space, even where future flood observations are not available.
The model relates a binary inundation outcome, flooded or not flooded, to predictors describing topography, hydrology, watershed processes, land cover, and the built environment. Logistic regression is appropriate because the outcome is binary and the results remain interpretable, making the method especially useful in a planning context where both prediction and explanation matter.
The results show that the model performed strongly in Calgary and produced plausible transferred predictions for Salt Lake City. The model achieved an AUC of 0.954, and the selected 0.35 threshold balanced improved recall with still-high overall accuracy and specificity. In both cities, the spatial pattern of predicted risk aligned with major drainage corridors and lower-lying areas. Overall, the analysis shows that this modeling approach is useful as a flood-screening tool for identifying broad areas of relative vulnerability and helping guide resilience planning.
The observed binary inundation target map shows the flood pattern that the model was trained to predict in Calgary. Cells coded as inundated are concentrated mainly along the Bow River corridor and connected flood-prone areas, while most of the surrounding study area is classified as not inundated. This spatial pattern confirms that flood exposure is highly uneven across the city and strongly tied to major river and drainage pathways.
To build the model, features were selected to represent several dimensions of flood vulnerability: direct hydrologic exposure, watershed behavior, land cover, and the built environment. The selected 15 features are listed in the table below. Specifically, four features are shown below for detailed visualization and explanation, including distance to nearest stream, agriculture ratio, maximum flow accumulation, and building cover ratio. Together, these variables reflect different but complementary aspects of flood processes.
| Category | Feature | Description | Data_Source |
|---|---|---|---|
| Topography | mean_elev | Average elevation within the fishnet cell. | DEM |
| min_elev | Lowest elevation within the fishnet cell. | DEM | |
| elev_range | Difference between maximum and minimum elevation. | DEM | |
| mean_slope | Average slope within the fishnet cell. | DEM | |
| max_slope | Maximum slope within the fishnet cell. | DEM | |
| sd_elev | Standard deviation of elevation within the fishnet cell. | DEM | |
| Hydrology | dist_nearest_stream | Distance from the cell centroid to the nearest stream. | OSM waterways |
| water_cover_area | Area of water cover inside the cell. | Surface water raster | |
| river_density | Stream length per unit area within the cell. | OSM waterways | |
| Watershed | max_flow_accum | Maximum flow accumulation within the cell. | DEM |
| Land cover | impervious_ratio | Share of the cell covered by impervious surfaces. | Impervious surface raster |
| vegetation_ratio | Share of the cell covered by vegetation. | NLCD | |
| open_soil_ratio | Share of the cell covered by open soil or bare ground. | NLCD | |
| agriculture_ratio | Share of the cell covered by agricultural land. | NLCD | |
| Built environment | building_cover_ratio | Share of the cell covered by building footprints. | OSM Building footprint |
This is a useful feature because cells closer to mapped stream channels are more likely to be inundated, while cells farther away are less likely to flood. It is assumed as a negative predictor, since risk rises near channels and valley bottoms where water tends to concentrate. On the map, darker cells are closer to streams and brighter cells are farther away, clearly showing the corridors of greatest likely exposure.
This captures land-cover differences that may influence runoff and flood exposure. Agricultural land can reflect less built-up, more permeable surfaces, but it can also occur in low-lying areas that remain vulnerable to flooding, making it less direct than stream-based features. On the map, darker cells have little agricultural land, while brighter cells show higher agricultural coverage concentrated mainly along the outer edges of the study area.
This shows how much upslope runoff drains toward each cell. Inundation depends both on local conditions and on how water collects and moves across the landscape. Cells with higher flow accumulation are more likely to lie along drainage paths or low areas where water concentrates. On the map, brighter lines show the main flow paths, while darker cells indicate much lower accumulated flow.
This was included to represent the built environment and level of development within each grid cell. It is useful because areas with more building coverage often have more impervious surfaces, which can reduce infiltration and increase runoff. On the map, brighter cells show higher building coverage and darker cells show less built-up land, with development concentrated in clusters rather than evenly distributed across the study area..
The correlation matrix shows that some topography variables, especially elevation and slope-related variables, are strongly positively correlated with one another. This is expected because these features describe related aspects of topography. To make the model more robust and reduce the effects of multicollinearity, three highly correlated topographic variables were removed before fitting the final logistic regression. These excluded features were minimum elevation, mean slope, and standard deviation of elevation. The remaining twelve variables were retained for logistic modeling.
To estimate flood inundation probability, we fit a logistic regression model using Calgary as the training city. The dependent variable was binary, with each 60-meter fishnet cell coded as either inundated or not inundated based on the observed flood extent. Logistic regression was appropriate for this task because the goal was to predict the probability of a binary outcome rather than a continuous value. The logistic regression provides a transparent way to estimate those relationships between features and target while producing cell-level probability predictions.
The modeling workflow involved data cleaning of features, retaining final predictors, feature standardization, and splitting the Calgary data into training and test sets. This step was important because it allowed model performance to be evaluated on unseen data rather than only on the same cells used to fit the model. In a planning context, this matters because the model is intended not just to explain the Calgary flood pattern, but also to support transfer of the same logic to Salt Lake City. Testing the model on held-out Calgary cells therefore provides a more realistic measure of how well the model generalizes.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -7.0417579 | 0.0665171 | -105.8639302 | 0.0000000 |
| mean_elev | -2.4239484 | 0.0386969 | -62.6393278 | 0.0000000 |
| elev_range | -0.2779017 | 0.0367864 | -7.5544601 | 0.0000000 |
| max_slope | 0.0147673 | 0.0336994 | 0.4382063 | 0.6612368 |
| dist_nearest_stream | -3.3665298 | 0.0595080 | -56.5726998 | 0.0000000 |
| water_cover_area | 0.4549904 | 0.0113383 | 40.1285569 | 0.0000000 |
| river_density | -0.0943250 | 0.0087636 | -10.7632916 | 0.0000000 |
| max_flow_accum | 0.1235709 | 0.0126938 | 9.7347172 | 0.0000000 |
| impervious_ratio | -0.0029470 | 0.0189975 | -0.1551238 | 0.8767237 |
| vegetation_ratio | 0.0781022 | 0.0175655 | 4.4463417 | 0.0000087 |
| open_soil_ratio | -0.1546469 | 0.0244250 | -6.3314878 | 0.0000000 |
| agriculture_ratio | -0.8483197 | 0.0356294 | -23.8095154 | 0.0000000 |
| building_cover_ratio | -0.1174399 | 0.0177548 | -6.6145363 | 0.0000000 |
The final model results show that several predictors contributed meaningfully to the estimated probability of inundation. Distance to nearest stream had a strong negative relationship with flooding, indicating that cells farther from mapped channels were less likely to be inundated. Maximum flow accumulation had a positive relationship with inundation, which matches the expectation that cells receiving more upslope drainage are more exposed to concentrated runoff and drainage flow. Agriculture ratio and building cover ratio were both negative in the final model, suggesting that their relationship to inundation is more complex and tied to the broader spatial structure of the study area. Overall, the model coefficients align reasonably well with the physical geography of riverine flooding, especially for variables connected to stream proximity, elevation, and watershed drainage.
To evaluate model quality, the ROC curve provides an overall summary of how well the model separates inundated from non-inundated cells across all possible classification thresholds. The model produced an AUC of 0.954, which indicates excellent discrimination. In practical terms, this means the model usually assigns higher predicted probabilities to cells that were actually inundated than to those that were not.
Because logistic regression produces probabilities rather than fixed classes, a classification threshold had to be chosen to convert predicted probabilities into inundated versus not inundated outcomes. We therefore compared several thresholds on the Calgary test split. This comparison showed the expected tradeoff. Lowering the threshold increased recall, meaning more inundated cells were identified (reducing False Positive),but it also reduced precision and specificity. A threshold of 0.35 was selected as a reasonable balance because it improved recall compared with higher cutoffs while still maintaining strong overall accuracy and high specificity. And this is allign with the core idea and planning concept of this model, which is to identify potential inundated urban areas.
The confusion matrix are: True Positive (TP) The model predicted that a cell would flood, and it actually did flood. True Negative (TN) The model predicted that a cell would not flood, and it actually did not flood. False Positive (FP) The model predicted flooding, but the cell was not actually inundated. False Negative (FN) The model predicted no flooding, but the cell was actually inundated.
Using the selected threshold of 0.35, the model achieved an accuracy of 95.6%, a precision of 0.623, a recall of 0.508, an F1 score of 0.560, and a specificity of 0.982 on the Calgary test split. These results suggest that the model performs strongly overall, especially in identifying non-inundated cells. However, the high accuracy should be interpreted with caution, as it may be inflated by class imbalance due to the large number of 0 labels. At the same time, recall is noticeably lower than specificity, which means the model still misses some flooded cells even though its overall accuracy is high. This is also due to the unbalance of binary label.
| Threshold | TP | TN | FP | FN | Accuracy | Precision | Recall | F1 | Specificity |
|---|---|---|---|---|---|---|---|---|---|
| 0.50 | 1291 | 55111 | 406 | 1929 | 0.9602465 | 0.7607543 | 0.4009317 | 0.5251169 | 0.9926869 |
| 0.40 | 1511 | 54794 | 723 | 1709 | 0.9585951 | 0.6763653 | 0.4692547 | 0.5540887 | 0.9869770 |
| 0.35 | 1637 | 54526 | 991 | 1583 | 0.9561775 | 0.6229072 | 0.5083851 | 0.5598495 | 0.9821496 |
The test-set classification outcomes map shows where the model performs well and where errors remain. True positives are concentrated along the main river corridor and major drainage routes, showing that the model captures the broad flood pattern well. True negatives dominate the larger background area away from channels, which is consistent with the high specificity of the model. False negatives appear mainly near flood edges, suggesting some underprediction in transition zones, while false positives occur in a smaller number of cells, often near hydrologic pathways. Overall, the map suggests that the model reproduces the main flood structure reasonably well, with some local over- and under-prediction remaining.
The full predicted probability map for Calgary are shown below. Darker cells indicate lower predicted flood probability, while brighter cells indicate higher predicted flood probability. For planning purposes, this continuous surface is useful because it shows relative flood risk across the city.
Overall, the final logistic regression model performed well as a flood-screening tool for Calgary. Its strongest performance lies in distinguishing broad low-risk and high-risk areas, especially along major stream and drainage corridors. The very high AUC and strong specificity suggest that the model captures the general structure of flood exposure effectively.
However, the moderate recall also shows that some observed inundated cells are still missed, especially near the edges of flooded areas. This means the model should be understood as a useful planning approximation rather than a perfect representation of flood behavior at every location.
The Salt Lake City prediction maps show how the Calgary-trained logistic regression model transfers to a comparable city using the same features and standardization metrics. The continuous probability map shows that predicted flood exposure is concentrated in specific corridors and clusters rather than spread evenly across the study area. Higher probabilities appear most clearly in the western and northwestern portions of Salt Lake City and along connected drainage paths, while much of the eastern and southeastern area remains at lower predicted risk.
The binary prediction map applies the selected 0.35 threshold and shows the same pattern in simpler form. Most cells are classified as dry, while predicted inundation is concentrated in smaller pockets and corridors, especially in the western portion of the study area. From a planning perspective, these results are best interpreted as a screening tool rather than a definitive flood map.
Overall, the logistic regression model performed well as a flood-screening tool for Calgary and produced spatially plausible predictions for Salt Lake City. In Calgary, the model achieved strong overall performance, with high AUC and specificity, and the predicted flood surface aligned well with major river corridors and drainage paths. Still, the model missed some observed inundated cells, especially near flood edges, so it should be understood as a useful approximation rather than an exact representation of flood behavior.
The Salt Lake City results suggest that the model is also useful as a comparable-city planning tool. The transferred predictions identified a clear pattern of higher flood exposure in specific corridors and lower-lying areas, rather than producing random hotspots. This makes the model valuable as a first-pass estimate of where a Calgary-type flood event might create greater exposure in Salt Lake City, although it should not be treated as a substitute for local hydraulic modeling or official floodplain mapping.Because the model was trained on Calgary data, its generalizability is limited; transfer to other cities is most appropriate where hydrologic, topographic, and climatic conditions are similar to Calgary. For future applications in other cities, incorporating local training data to refine the model would likely improve performance.
Overall, the main value of this approach is that it gives planners a practical way to identify broad patterns of flood vulnerability using a method that is both interpretable and transferable. It is most useful for early screening, spatial targeting, and helping cities decide where more detailed flood analysis and resilience planning are most needed.