TWtest/b0bcda2829e3master
/
README.md
Building machine learning models to predict biotransformation half-lives of micropollutants in activated sludge
- Key Words:
- *Machine Learning, Bayesian Inference, Combined Learning, Biotransformation, Organic Micropollutants*
Project Description:
This project involved curating the Eawag-Sludge package in enviPath database and using the curated data to develop machine learning models that can predict biotransformation half-lives of organic micropollutants in activated sludge.
To overcome the limited size of training data, we combined the kinetics data of the Eawag-Soil package, which contained 895 data points, with the 160 half-lives of the Eawag-Sludge package. To support the theoretical basis for this combined learning approach, we performed a read-across analysis on 27 compounds that were common to both data sets. Our results indicate that using the Bayesian mean is more effective in capturing the relationship between the half-lives of sludge and soil.
How to use these scripts?
Step 1:
extract_all_compounds.py input: (none) output: sludge_Original_raw.tsv sludge_Leo_raw.tsv sludge_Rich_raw.tsv
package_id: The url of the Eawag-Sludge package or expanded packages created by other contributors. For example:
- Original sludge package:
https://envipath.org/package/4a3cd0f4-4d2b-4f00-b3e6-a29e721f7038
- Rich sludge package:
https://envipath.org/package/8d3d7ca2-ae4e-4779-a6a2-d3539237c439
- Leo sludge package: https://envipath.org/package/195bc500-f0c6-4bcb-b2fe-f1602b5f20a2
Step 2:
calculations.py input: slduge_Original_raw.tsv slduge_Leo_raw.tsv slduge_Rich_raw.tsv output: sludge_raw_bay3_check.tsv
The purpose of this Python script is to combine various sources of the Eawag-Sludge package. Since some of the packages only have either rate constants or half-lives, we can utilize the reaction order formula to convert kinetic data between them. The default reaction type assumed is a first-order reaction. In addition to this, the script also incorporates biomass for each half-life. Finally, the relevant kinetic data is collected and subjected to log calculations, including:
- k_combined
- k_biomass_corrected
- hl_combined
- hl_biomass_corrected
- log_k_combined
- log_k_biomass_corrected
- log_hl_combined
- log_hl_biomass_corrected
Step 3:
calculation_new.py input: output: sludge_raw_bay2.tsv
Step 4:
calculate_target_variables.py input: sludge_raw_bay2.tsv output: sludge_biomass_bay2.tsv
Markdown the part of the calculation of BI mean.
Step 5:
Chowag.ipynb input: sludge_biomass_bay2.tsv output: (None) mean of biomass_hl_log_gmean std of biomass_hl_log_gmean mean of biomass_hl_log_std std of biomass_hl_log_std
This step use the Visual Studio Code console to visualize the data distribution of each type of target variables.
First group by the "reduced_smiles", "biomass_hl_log_std", "biomass_hl_log_median", "biomass_hl_log_gmean", then calculate the mean and std of the target variables which would be set up as the parameter of Bayesian.py.
Step 6:
calculate_target_variables.py input: sludge_biomass_bay2.tsv output: sludge_bay_PriorMuStd_bay2.tsv
Markup the part of the calculation of BI mean as we want to calculate the mean of Bayesian Inference of the target variable.
Step 7:
Chowag.ipynb input: sludge_bay_PriorMuStd_bay2.tsv output: sludge_bay_PriorMuStd_bay3.tsv
Group by the following columns: "hl_log_std", "hl_log_median", "hl_log_gmean", "hl_log_bayesian_mean","biomass_hl_log_std", "biomass_hl_log_median", "biomass_hl_log_gmean", "biomass_hl_log_bayesian_mean".
Step 8:
combined_learning.py input: soil_model_input_all_full.tsv sludge_bay_PriorMuStd_bay3. soil_bayesian_merge_padel_reduced.tsv output: (RMSE and R2 figures.)
This section take not only include the target variables but also the multiple combinations of the molecular descriptor or molecular fingerprint, such as:
- PaDEL
- MACCS
- btrules
- PaDEL + MACCS
- PaDEL + btrules
- MACCS + btrules
- PaDEL + MACCS + btrules
These descriptor or fingerprint were preprocess and merge together before the model training
In the beginning of the script, there are 7 global variables that can be set up. For example, when modleing on the sludge dataset, we can set the COMBINED_LEARNING to False. ITERATIONS can set to 20 or even more. DESCRIPTORS can use the above mentioned 7 name, this global variables simply deliever the string to the title of the output figures and the file name. DENDROGRAM set to False as we don't need to plot the dendrogram everytime during the random forest feature selection. CLUSTER_THRESHOLD were set to 0.001 or 0.01 to remove the collinearity among the features. TOPS represent the number of selected top features for subsequent modelling.
Acknowledgement:
I extend my heartfelt thanks to Kunyang Zhang for his unwavering help and support during my research, offering valuable insights and suggestions that significantly improved the quality of my work.
The Eawag-Soil dataset, package of Bayesian Inference and the calculation of enviPath biotransformation rules (btrules) were provided by Jasmin Hafner.