Diffusion master_thesis_Lee (master) Browse

TWtest/
b0bcda2829e3master

README.md

Building machine learning models to predict biotransformation half-lives of micropollutants in activated sludge

Key Words:
1. *Machine Learning, Bayesian Inference, Combined Learning, Biotransformation, Organic Micropollutants*

Project Description:

This project involved curating the Eawag-Sludge package in enviPath database and using the curated data to develop machine learning models that can predict biotransformation half-lives of organic micropollutants in activated sludge.

To overcome the limited size of training data, we combined the kinetics data of the Eawag-Soil package, which contained 895 data points, with the 160 half-lives of the Eawag-Sludge package. To support the theoretical basis for this combined learning approach, we performed a read-across analysis on 27 compounds that were common to both data sets. Our results indicate that using the Bayesian mean is more effective in capturing the relationship between the half-lives of sludge and soil.

How to use these scripts?

Step 1:

extract_all_compounds.py
input: (none)
output: sludge_Original_raw.tsv
        sludge_Leo_raw.tsv
        sludge_Rich_raw.tsv

package_id: The url of the Eawag-Sludge package or expanded packages created by other contributors. For example:

Original sludge package:

https://envipath.org/package/4a3cd0f4-4d2b-4f00-b3e6-a29e721f7038

Rich sludge package:

https://envipath.org/package/8d3d7ca2-ae4e-4779-a6a2-d3539237c439

Leo sludge package: https://envipath.org/package/195bc500-f0c6-4bcb-b2fe-f1602b5f20a2

Step 2:

calculations.py
input: slduge_Original_raw.tsv
       slduge_Leo_raw.tsv
       slduge_Rich_raw.tsv
output: sludge_raw_bay3_check.tsv

The purpose of this Python script is to combine various sources of the Eawag-Sludge package. Since some of the packages only have either rate constants or half-lives, we can utilize the reaction order formula to convert kinetic data between them. The default reaction type assumed is a first-order reaction. In addition to this, the script also incorporates biomass for each half-life. Finally, the relevant kinetic data is collected and subjected to log calculations, including:

k_combined
k_biomass_corrected
hl_combined
hl_biomass_corrected
log_k_combined
log_k_biomass_corrected
log_hl_combined
log_hl_biomass_corrected

Step 3:

calculation_new.py
input: 
output: sludge_raw_bay2.tsv

Step 4:

calculate_target_variables.py
input: sludge_raw_bay2.tsv
output: sludge_biomass_bay2.tsv

Markdown the part of the calculation of BI mean.

Step 5:

Chowag.ipynb
input: sludge_biomass_bay2.tsv
output: (None)
        mean of biomass_hl_log_gmean
        std of biomass_hl_log_gmean
        mean of biomass_hl_log_std
        std of biomass_hl_log_std

This step use the Visual Studio Code console to visualize the data distribution of each type of target variables.

First group by the "reduced_smiles", "biomass_hl_log_std", "biomass_hl_log_median", "biomass_hl_log_gmean", then calculate the mean and std of the target variables which would be set up as the parameter of Bayesian.py.

Step 6:

calculate_target_variables.py
input: sludge_biomass_bay2.tsv
output: sludge_bay_PriorMuStd_bay2.tsv

Markup the part of the calculation of BI mean as we want to calculate the mean of Bayesian Inference of the target variable.

Step 7:

Chowag.ipynb
input: sludge_bay_PriorMuStd_bay2.tsv
output: sludge_bay_PriorMuStd_bay3.tsv

Group by the following columns: "hl_log_std", "hl_log_median", "hl_log_gmean", "hl_log_bayesian_mean","biomass_hl_log_std", "biomass_hl_log_median", "biomass_hl_log_gmean", "biomass_hl_log_bayesian_mean".

Step 8:

combined_learning.py
input: soil_model_input_all_full.tsv 
       sludge_bay_PriorMuStd_bay3.
       soil_bayesian_merge_padel_reduced.tsv
output: (RMSE and R2 figures.)

This section take not only include the target variables but also the multiple combinations of the molecular descriptor or molecular fingerprint, such as:

PaDEL
MACCS
btrules
PaDEL + MACCS
PaDEL + btrules
MACCS + btrules
PaDEL + MACCS + btrules

These descriptor or fingerprint were preprocess and merge together before the model training

In the beginning of the script, there are 7 global variables that can be set up. For example, when modleing on the sludge dataset, we can set the COMBINED_LEARNING to False. ITERATIONS can set to 20 or even more. DESCRIPTORS can use the above mentioned 7 name, this global variables simply deliever the string to the title of the output figures and the file name. DENDROGRAM set to False as we don't need to plot the dendrogram everytime during the random forest feature selection. CLUSTER_THRESHOLD were set to 0.001 or 0.01 to remove the collinearity among the features. TOPS represent the number of selected top features for subsequent modelling.

Acknowledgement:

I extend my heartfelt thanks to Kunyang Zhang for his unwavering help and support during my research, offering valuable insights and suggestions that significantly improved the quality of my work.

The Eawag-Soil dataset, package of Bayesian Inference and the calculation of enviPath biotransformation rules (btrules) were provided by Jasmin Hafner.

.gitignore
.idea/
CL_sludge_padel_feature_selected.ipynb
Chowag.ipynb
EAWAG-SLUDGE.json
Modeling/
New Text Document.txt
README.md
analyse_dataset.py
calculate_target_variables.py
calculation.py
calculation_new.py
combined_learning.py
expand_dataset_.py
extract_all_compounds_.py
ouputfile.txt
output/
sludge_caculate_PaDEL_descriptor.py
sludge_dataset.xlsx
sludge_rule_feature_selected.ipynb

TWtest/b0bcda2829e3master

/