IIIT Hyderabad Publications |
|||||||||
|
A Hybrid Machine Learning Framework for River Water Quality Prediction under Data UncertaintiesAuthor: RAJESH MADDU (2019900041) Date: 2024-04-18 Report no: IIIT/TH/2024/39 Advisor:Shaik Rehana AbstractThe impact of climate change on water quality variables is an essential topic for sustainable river water quality management in a warming environment and is a great environmental concern worldwide. River Water Quality (RWQ) models aim to simulate the behavior of various water quality variables in response to pollutants, land use changes, and climate change. However, these water quality models suffer from sparse data leading to data uncertainty. In the past decades, different models have been successfully used for RWQ modeling under different spatial and temporal scales. To simulate RWQ variables, physically based water quality models can be used, but they require large amounts of site-specific detailed data, including stream geometry, meteorological variables, and hydraulic properties of the river, which are unavailable for many river systems globally. However, unlike processbased models, statistical models possess many advantages. Additionally, statistical models do not require a large number of input variables, which are unavailable for many ungauged river systems. However, accurately describing the nonlinear characteristics of a data series is a significant shortcoming of this approach. To overcome such limitations, artificial intelligence algorithms, i.e., Machine Learning (ML) techniques, are widely used to address a range of nonlinear prediction problems. Such models are suited for information extraction from sequential data in RWQ modeling, and they serve functionalities to build models using a reduced number of variables with more accurate simulation. Machine Learning (ML) has been increasingly adopted due to its ability to model complex and nonlinearities between river water quality (RWQ) variables and their predictors (e.g., Air Temperature, AT, streamflow). To simulate RWQ parameters using data-driven algorithms, more input variables are required, which are unavailable for many ungauged river systems. Climatic variables that are readily available are the maximum, minimum, and average AT to build RWQ models with more accurate simulation and higher computational efficiency. In this context, most of these ML approaches have been applied without any detailed sensitivity analysis to identify the most influencing variables to be considered in the prediction of RWQ variables. Furthermore, the development of systematic models combined with ML under minimum data input variables has not been intensively studied in predicting RWQ variables. To address these, the present study first demonstrates how new ML approaches, such as Ridge regression (RR), K-nearest neighbors (KNN) regressor, Random Forest (RF) regressor, and Support Vector Regression (SVR), can be coupled with Sobol’ global sensitivity analysis (GSA) to predict accurate RWQ variables estimates. Air Temperature (AT) changes can affect River Water Temperature (RWT) under anthropogenic climate change, the primary variable that influences water quality. Therefore, the present study selected RWT as a water quality variable prediction with a tropical river system of India, Tunga-Bhadra River, as a case study. Further, the proposed ML approaches have been combined with the Ensemble Kalman Filter (EnKF) data assimilation (DA) technique to improve the predicted values based on the measured data. Overall, the study concluded that the SVR has been noted as the most robust ML model when coupled with a global sensitivity algorithm and DA techniques to predict RWT at a monthly time scale compared to daily and seasonal. Also, the study concluded that the SVR model is a strong choice for smaller datasets and is less sensitive to outliers in the data compared to some other models. The SVR is generally less computationally expensive than the ML models. Another data uncertainty is the lack of availability of long-time series data to capture interannual variability and consistent water quality measurement datasets in RWQ modeling. Generally, RWQ data availability is on a monthly scale and is burdened with a large number of missing values with limited durations. In this context, the selection of appropriate model inputs, development of models under limited data, processing of non-stationary data, seasonality scenarios, and different potentially influenced relevant lags of variables have not been intensively investigated in the literature, especially in the case of estimation of RWQ variables. Given the missing, limited, and non-stationary data scenarios, the present thesis developed hybrid models for RWQ variables prediction using Long Short-Term Memory (LSTM), integrated with (i) k-nearest neighbor (k-NN) bootstrap resampling algorithms (kNN-LSTM) to address the data-limitations and (ii) discrete wavelet transform (WT) approach (WT-LSTM) to address the time-frequency localized features. To demonstrate the prediction of RWQ variables and to assess the impact of climate change on the river water quality parameters, this study considered the two most important water quality variables, i.e., River Water Temperature (RWT) and saturated Dissolved Oxygen (DO) concentrations, and AT and lag variables as predictors. When WT and k-NN bootstrap resampling algorithms were included, LSTM outperformed the conventional models; hence these hybrid models are the new promising frameworks for RWQ prediction under data-sparse regions. Bayesian optimization is applied to optimize the hyperparameters of all applied ML models. The hybrid kNN-LSTM has effectively predicted RWT for five catchment sites (i.e., Narmada, Cauvery, Musi, Godavari, and Ganga) out of seven catchment sites (i.e., Narmada, Cauvery, Sabarmati, Tunga-Bhadra, Musi, Godavari, and Ganga) at monthly time scales under data limitations and outperformed the standalone LSTM, WT-LSTM, and hybrid 3-parameter version of Air2Stream models (physical based RWT prediction model). Also, this thesis presents the combined effects of streamflow and AT in the prediction of RWT using the kNN-LSTM model, LSTM model, a modified nonlinear regression model, and an 8- parameter version of Air2Stream when applied to three major river systems of India (TungaBhadra, Musi, and Ganga). Results revealed that the kNN-LSTM model could predict RWT more accurately than the LSTM model, a modified nonlinear regression model, and an 8- parameter version of the Air2Stream model for all three catchment sites. Overall, the study concluded that hybrid models consistently outperformed standalone models in addressing uncertainty due to data sparsity. The study assessed the climate change impacts on river water quality variables using an Ensemble of National Aeronautics Space Administration (NASA) Earth Exchange Global Daily Downscaled Projections (NEX-GDDP) with Representative Concentration Pathways (RCP) scenarios 4.5 and 8.5 for seven major polluted river catchments of India. For this assessment, the best performance hybrid kNN-LSTM model has been used for future predictions. The RWT increase for Tunga-Bhadra, Musi, Ganga, and Narmada basins are predicted as 3.0, 4.0, 4.6, and 4.7 oC, respectively for 2071-2100. Overall, RWT over Indian catchments is likely to rise by more than 3.0 °C for 2071-2100. While river water temperatures (RWTs) are increasing under climate change signals, how climate change affects DO saturation levels in response to RWT has not been intensively studied. This thesis examined the direct effect of rising RWTs on saturated DO concentrations for seven major polluted river catchments of India at a monthly scale. The RWT reaches close to 35 oC, and decreases DO saturation capacity by 2%–12% for 2071– 2100. Also, in this thesis evaluated the effect of climate change on DO saturation levels with respect to RWT and streamflow using the kNN-LSTM model forced with nine hypothetical climate change scenarios for three polluted catchments of India (Tunga-Bhadra, Musi, and Ganga). The largest DO decreases (13.22 %) were found in the Ganga catchment for selected climate change scenarios relative to the historical values. Overall, for every 1 oC RWT increase, there will be about 2.3 % decrease in DO saturation level concentrations over Indian catchments under climate signals. Overall, the study demonstrates how hybrid ML methods can be coupled with a global sensitivity algorithm, DA techniques, bootstrapping algorithms, and wavelets to generate accurate RWQ variables prediction under data uncertainties. Although the focus of our study has been limited to climate change impacts on RWT and DO saturations, the proposed hybrid ML modeling frameworks are generic and have the potential to incorporate other water quality parameters as well to make better decisions towards river water quality management Full thesis: pdf Centre for Spatial Informatics |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |