1 Introduction

Flooding is a major planning challenge because it can damage infrastructure, disrupt transportation, threaten public safety, and create long-term recovery costs. In this project, a logistic regression model was used to estimate the probability that individual 60-meter grid cells would be inundated during a major flood event. The model was trained on observed flood data from Calgary’s 2013 flood and then applied to Salt Lake City as a comparable city. This approach is useful for planning because it uses environmental and built-form conditions to estimate relative flood exposure across space, even where future flood observations are not available.

The model relates a binary inundation outcome, flooded or not flooded, to predictors describing topography, hydrology, watershed processes, land cover, and the built environment. Logistic regression is appropriate because the outcome is binary and the results remain interpretable, making the method especially useful in a planning context where both prediction and explanation matter.

The results show that the model performed strongly in Calgary and produced plausible transferred predictions for Salt Lake City. The model achieved an AUC of 0.954, and the selected 0.35 threshold balanced improved recall with still-high overall accuracy and specificity. In both cities, the spatial pattern of predicted risk aligned with major drainage corridors and lower-lying areas. Overall, the analysis shows that this modeling approach is useful as a flood-screening tool for identifying broad areas of relative vulnerability and helping guide resilience planning.

1.1 Target Flood Inundation Map

The observed binary inundation target map shows the flood pattern that the model was trained to predict in Calgary. Cells coded as inundated are concentrated mainly along the Bow River corridor and connected flood-prone areas, while most of the surrounding study area is classified as not inundated. This spatial pattern confirms that flood exposure is highly uneven across the city and strongly tied to major river and drainage pathways.

2 Feature Selection

To build the model, features were selected to represent several dimensions of flood vulnerability: direct hydrologic exposure, watershed behavior, land cover, and the built environment. The selected 15 features are listed in the table below. Specifically, four features are shown below for detailed visualization and explanation, including distance to nearest stream, agriculture ratio, maximum flow accumulation, and building cover ratio. Together, these variables reflect different but complementary aspects of flood processes.

Selected 15 features
Category	Feature	Description	Data_Source
Topography	mean_elev	Average elevation within the fishnet cell.	DEM
	min_elev	Lowest elevation within the fishnet cell.	DEM
	elev_range	Difference between maximum and minimum elevation.	DEM
	mean_slope	Average slope within the fishnet cell.	DEM
	max_slope	Maximum slope within the fishnet cell.	DEM
	sd_elev	Standard deviation of elevation within the fishnet cell.	DEM
Hydrology	dist_nearest_stream	Distance from the cell centroid to the nearest stream.	OSM waterways
	water_cover_area	Area of water cover inside the cell.	Surface water raster
	river_density	Stream length per unit area within the cell.	OSM waterways
Watershed	max_flow_accum	Maximum flow accumulation within the cell.	DEM
Land cover	impervious_ratio	Share of the cell covered by impervious surfaces.	Impervious surface raster
	vegetation_ratio	Share of the cell covered by vegetation.	NLCD
	open_soil_ratio	Share of the cell covered by open soil or bare ground.	NLCD
	agriculture_ratio	Share of the cell covered by agricultural land.	NLCD
Built environment	building_cover_ratio	Share of the cell covered by building footprints.	OSM Building footprint

2.1 Distance to Nearest Stream

This is a useful feature because cells closer to mapped stream channels are more likely to be inundated, while cells farther away are less likely to flood. It is assumed as a negative predictor, since risk rises near channels and valley bottoms where water tends to concentrate. On the map, darker cells are closer to streams and brighter cells are farther away, clearly showing the corridors of greatest likely exposure.

2.2 Agruiculture Ratio

This captures land-cover differences that may influence runoff and flood exposure. Agricultural land can reflect less built-up, more permeable surfaces, but it can also occur in low-lying areas that remain vulnerable to flooding, making it less direct than stream-based features. On the map, darker cells have little agricultural land, while brighter cells show higher agricultural coverage concentrated mainly along the outer edges of the study area.

2.3 Maximum Flow Accumulation

This shows how much upslope runoff drains toward each cell. Inundation depends both on local conditions and on how water collects and moves across the landscape. Cells with higher flow accumulation are more likely to lie along drainage paths or low areas where water concentrates. On the map, brighter lines show the main flow paths, while darker cells indicate much lower accumulated flow.

2.4 Building Cover Ratio

This was included to represent the built environment and level of development within each grid cell. It is useful because areas with more building coverage often have more impervious surfaces, which can reduce infiltration and increase runoff. On the map, brighter cells show higher building coverage and darker cells show less built-up land, with development concentrated in clusters rather than evenly distributed across the study area..

2.5 Multicollinearity Test

The correlation matrix shows that some topography variables, especially elevation and slope-related variables, are strongly positively correlated with one another. This is expected because these features describe related aspects of topography. To make the model more robust and reduce the effects of multicollinearity, three highly correlated topographic variables were removed before fitting the final logistic regression. These excluded features were minimum elevation, mean slope, and standard deviation of elevation. The remaining twelve variables were retained for logistic modeling.

3 Model Development and Performance

To estimate flood inundation probability, we fit a logistic regression model using Calgary as the training city. The dependent variable was binary, with each 60-meter fishnet cell coded as either inundated or not inundated based on the observed flood extent. Logistic regression was appropriate for this task because the goal was to predict the probability of a binary outcome rather than a continuous value. The logistic regression provides a transparent way to estimate those relationships between features and target while producing cell-level probability predictions.

The modeling workflow involved data cleaning of features, retaining final predictors, feature standardization, and splitting the Calgary data into training and test sets. This step was important because it allowed model performance to be evaluated on unseen data rather than only on the same cells used to fit the model. In a planning context, this matters because the model is intended not just to explain the Calgary flood pattern, but also to support transfer of the same logic to Salt Lake City. Testing the model on held-out Calgary cells therefore provides a more realistic measure of how well the model generalizes.

Final logistic regression model summary
term	estimate	std.error	statistic	p.value
(Intercept)	-7.0417579	0.0665171	-105.8639302	0.0000000
mean_elev	-2.4239484	0.0386969	-62.6393278	0.0000000
elev_range	-0.2779017	0.0367864	-7.5544601	0.0000000
max_slope	0.0147673	0.0336994	0.4382063	0.6612368
dist_nearest_stream	-3.3665298	0.0595080	-56.5726998	0.0000000
water_cover_area	0.4549904	0.0113383	40.1285569	0.0000000
river_density	-0.0943250	0.0087636	-10.7632916	0.0000000
max_flow_accum	0.1235709	0.0126938	9.7347172	0.0000000
impervious_ratio	-0.0029470	0.0189975	-0.1551238	0.8767237
vegetation_ratio	0.0781022	0.0175655	4.4463417	0.0000087
open_soil_ratio	-0.1546469	0.0244250	-6.3314878	0.0000000
agriculture_ratio	-0.8483197	0.0356294	-23.8095154	0.0000000
building_cover_ratio	-0.1174399	0.0177548	-6.6145363	0.0000000

The final model results show that several predictors contributed meaningfully to the estimated probability of inundation. Distance to nearest stream had a strong negative relationship with flooding, indicating that cells farther from mapped channels were less likely to be inundated. Maximum flow accumulation had a positive relationship with inundation, which matches the expectation that cells receiving more upslope drainage are more exposed to concentrated runoff and drainage flow. Agriculture ratio and building cover ratio were both negative in the final model, suggesting that their relationship to inundation is more complex and tied to the broader spatial structure of the study area. Overall, the model coefficients align reasonably well with the physical geography of riverine flooding, especially for variables connected to stream proximity, elevation, and watershed drainage.

3.1 Model Performance

To evaluate model quality, the ROC curve provides an overall summary of how well the model separates inundated from non-inundated cells across all possible classification thresholds. The model produced an AUC of 0.954, which indicates excellent discrimination. In practical terms, this means the model usually assigns higher predicted probabilities to cells that were actually inundated than to those that were not.

Because logistic regression produces probabilities rather than fixed classes, a classification threshold had to be chosen to convert predicted probabilities into inundated versus not inundated outcomes. We therefore compared several thresholds on the Calgary test split. This comparison showed the expected tradeoff. Lowering the threshold increased recall, meaning more inundated cells were identified (reducing False Positive),but it also reduced precision and specificity. A threshold of 0.35 was selected as a reasonable balance because it improved recall compared with higher cutoffs while still maintaining strong overall accuracy and high specificity. And this is allign with the core idea and planning concept of this model, which is to identify potential inundated urban areas.

The confusion matrix are: True Positive (TP) The model predicted that a cell would flood, and it actually did flood. True Negative (TN) The model predicted that a cell would not flood, and it actually did not flood. False Positive (FP) The model predicted flooding, but the cell was not actually inundated. False Negative (FN) The model predicted no flooding, but the cell was actually inundated.

Using the selected threshold of 0.35, the model achieved an accuracy of 95.6%, a precision of 0.623, a recall of 0.508, an F1 score of 0.560, and a specificity of 0.982 on the Calgary test split. These results suggest that the model performs strongly overall, especially in identifying non-inundated cells. However, the high accuracy should be interpreted with caution, as it may be inflated by class imbalance due to the large number of 0 labels. At the same time, recall is noticeably lower than specificity, which means the model still misses some flooded cells even though its overall accuracy is high. This is also due to the unbalance of binary label.

Threshold comparison on the Calgary test split
Threshold	TP	TN	FP	FN	Accuracy	Precision	Recall	F1	Specificity
0.50	1291	55111	406	1929	0.9602465	0.7607543	0.4009317	0.5251169	0.9926869
0.40	1511	54794	723	1709	0.9585951	0.6763653	0.4692547	0.5540887	0.9869770
0.35	1637	54526	991	1583	0.9561775	0.6229072	0.5083851	0.5598495	0.9821496

3.2 Interpreting the Classification Outcomes

The test-set classification outcomes map shows where the model performs well and where errors remain. True positives are concentrated along the main river corridor and major drainage routes, showing that the model captures the broad flood pattern well. True negatives dominate the larger background area away from channels, which is consistent with the high specificity of the model. False negatives appear mainly near flood edges, suggesting some underprediction in transition zones, while false positives occur in a smaller number of cells, often near hydrologic pathways. Overall, the map suggests that the model reproduces the main flood structure reasonably well, with some local over- and under-prediction remaining.

The full predicted probability map for Calgary are shown below. Darker cells indicate lower predicted flood probability, while brighter cells indicate higher predicted flood probability. For planning purposes, this continuous surface is useful because it shows relative flood risk across the city.

3.3 Overall Model Assessment and Limitation

Overall, the final logistic regression model performed well as a flood-screening tool for Calgary. Its strongest performance lies in distinguishing broad low-risk and high-risk areas, especially along major stream and drainage corridors. The very high AUC and strong specificity suggest that the model captures the general structure of flood exposure effectively.

However, the moderate recall also shows that some observed inundated cells are still missed, especially near the edges of flooded areas. This means the model should be understood as a useful planning approximation rather than a perfect representation of flood behavior at every location.

4 Salt Lake City Prediction Analysis

The Salt Lake City prediction maps show how the Calgary-trained logistic regression model transfers to a comparable city using the same features and standardization metrics. The continuous probability map shows that predicted flood exposure is concentrated in specific corridors and clusters rather than spread evenly across the study area. Higher probabilities appear most clearly in the western and northwestern portions of Salt Lake City and along connected drainage paths, while much of the eastern and southeastern area remains at lower predicted risk.

The binary prediction map applies the selected 0.35 threshold and shows the same pattern in simpler form. Most cells are classified as dry, while predicted inundation is concentrated in smaller pockets and corridors, especially in the western portion of the study area. From a planning perspective, these results are best interpreted as a screening tool rather than a definitive flood map.

5 Conclusion

Overall, the logistic regression model performed well as a flood-screening tool for Calgary and produced spatially plausible predictions for Salt Lake City. In Calgary, the model achieved strong overall performance, with high AUC and specificity, and the predicted flood surface aligned well with major river corridors and drainage paths. Still, the model missed some observed inundated cells, especially near flood edges, so it should be understood as a useful approximation rather than an exact representation of flood behavior.

The Salt Lake City results suggest that the model is also useful as a comparable-city planning tool. The transferred predictions identified a clear pattern of higher flood exposure in specific corridors and lower-lying areas, rather than producing random hotspots. This makes the model valuable as a first-pass estimate of where a Calgary-type flood event might create greater exposure in Salt Lake City, although it should not be treated as a substitute for local hydraulic modeling or official floodplain mapping.Because the model was trained on Calgary data, its generalizability is limited; transfer to other cities is most appropriate where hydrologic, topographic, and climatic conditions are similar to Calgary. For future applications in other cities, incorporating local training data to refine the model would likely improve performance.

Overall, the main value of this approach is that it gives planners a practical way to identify broad patterns of flood vulnerability using a method that is both interpretable and transferable. It is most useful for early screening, spatial targeting, and helping cities decide where more detailed flood analysis and resilience planning are most needed.

Flood Inundation Probability Modeling for Calgary and Salt Lake City

Coco Zhou, Mark Deng