Data imputation on a Real Public Dataset, Herold et al. (2020)¶

The article “Integration of time-series meta-omics data reveals how microbial ecosystems respond to disturbance” explores the dynamic nature of microbial communities in biological wastewater treatment plants (BWWTPs) by combining metagenomics (MG), metatranscriptomics (MT), and other meta-omics approaches. The study provides a detailed analysis of these communities’ resilience and resistance to environmental disturbances. We are going to use the data from this study to perform data imputation and analysis. Specifically, we will focus on the following figures:

Figure 3 - Community Structure and Function Dynamics:¶

Figure 3 presents the community structure and function dynamics over time, focusing on the relative abundance and expression levels of microbial populations. The data are derived from recovered populations represented by reconstructed metagenome-assembled genomes (rMAGs), which show genetic composition and activity through MG and MT depth coverage, with mapping percentages of 26% ± 3% and 27% ± 3%, respectively. The relative abundance data are categorized by genus-level taxonomic assignments, and less abundant genera are grouped under “Other” if their mean abundance is below 2%.

This figure also includes ordination plots based on Bray–Curtis dissimilarity of rMAGs’ relative abundances, constrained by abiotic factors such as metabolite levels, metabolite ratios, and physicochemical parameters (shown as black arrows). The results highlight seasonal shifts and environmental influences on microbial community dynamics, with time-course sampling illustrated through connecting points and color coding by the month of sampling.

Figure 4 - Metabolite Levels and Environmental Dynamics:¶

Figure 4 presents time-series data on metabolite levels and physicochemical parameters, highlighting fluctuations in metabolite classes, types, and measurement subtypes. It uses a heatmap to display Z-score transformed intensities of metabolites and other parameters, showcasing intracellular and extracellular metabolite levels, ratios, and several environmental parameters collected manually or through continuous monitoring at a wastewater treatment plant. This data reflects how environmental and metabolite dynamics interact, impacting microbial activity and metabolism over time (Herold et al., 2020).

By focusing on these figures, our VAR (Vector Autoregression) inference and Gaussian Process imputer (GP Imputer) can leverage time-series trends and correlations between microbial populations and their environment within BWWTP systems.

[1]:

import matplotlib.pyplot as plt
import numpy as np


import pandas as pd
import seaborn as sns


from mimic.data_imputation.impute_GP import GPImputer
from mimic.model_infer.infer_VAR_bayes import *

WARNING:tensorflow:From c:\ProgramData\anaconda3\envs\MIMIC\Lib\site-packages\tf_keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.

WARNING:tensorflow:From c:\ProgramData\anaconda3\envs\MIMIC\Lib\site-packages\tensorflow_probability\python\internal\backend\numpy\_utils.py:48: The name tf.logging.TaskLevelStatusMessage is deprecated. Please use tf.compat.v1.logging.TaskLevelStatusMessage instead.

WARNING:tensorflow:From c:\ProgramData\anaconda3\envs\MIMIC\Lib\site-packages\tensorflow_probability\python\internal\backend\numpy\_utils.py:48: The name tf.control_flow_v2_enabled is deprecated. Please use tf.compat.v1.control_flow_v2_enabled instead.

WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.

Imputing Microbial Community Data (Figure 3) using Gaussian Process Imputer¶

[ ]:

data = pd.read_csv(r'Source Data/fig3ab.tsv', delimiter='\t')

data.head(15)

Separate the data into MG and MT datasets¶

The data is separated into two datasets, MG and MT. The MG dataset contains the meta-genomics data, while the MT dataset contains the meta-transcriptomics. We will be using the MG dataset in this example.

[ ]:

# separate the data into MG and MT datasets

dataMG = data[data['type'] == 'MG']

dataMG.info()

dataMT = data[data['type'] == 'MT']

dataMT.info()

We will proceed with some data wrangling to prepare the data for the imputation process.

[ ]:

# Data cleaning
# Rename Nan in the genus column to 'Other'

dataMG['genus'] = dataMG['genus'].replace(np.nan, 'Other')

# Drop the 'type' column
dataMG = dataMG.drop(columns=['type'])

# Change relative Date columns to days
dataMG['date'] = pd.to_datetime(dataMG['date'])
dataMG['date'] = (dataMG['date'] - dataMG['date'].min()).dt.days

# Change the 'genus' column to a categorical column
dataMG['genus'] = dataMG['genus'].astype('category')


dataMG.info()

[ ]:

# Plot the data, relative abundance (value) vs. days (date) for each genus


sns.set(style="whitegrid")
plt.figure(figsize=(12, 6))
sns.lineplot(x='date', y='val', hue='genus', data=dataMG)
plt.title('Microbial Growth Data')
plt.xlabel('Days')
plt.ylabel('Relative Abundance')
plt.show()

[ ]:

# Pivot the data so each 'genus' is a column
dataMG_pivot = dataMG.pivot(
    index='date', columns='genus', values='val').fillna(0)

# Create a stacked area chart
plt.figure(figsize=(12, 6))
plt.stackplot(dataMG_pivot.index, dataMG_pivot.T, labels=dataMG_pivot.columns)

plt.title('Microbial Growth Data')
plt.xlabel('Days')
plt.ylabel('Relative Abundance')
plt.legend(loc='upper left')

plt.show()

[ ]:

# Do the same data cleaning for the MT data set

# Rename Nan in the genus column to 'Other'

dataMT['genus'] = dataMT['genus'].replace(np.nan, 'Other')

# Drop the 'type' column
dataMT = dataMT.drop(columns=['type'])

# Change relative Date columns to days
dataMT['date'] = pd.to_datetime(dataMT['date'])
dataMT['date'] = (dataMT['date'] - dataMT['date'].min()).dt.days

# Change the 'genus' column to a categorical column
dataMT['genus'] = dataMT['genus'].astype('category')

[ ]:

# Plot the data, relative abundance (value) vs. days (date) for each genus


sns.set(style="whitegrid")
plt.figure(figsize=(12, 6))
sns.lineplot(x='date', y='val', hue='genus', data=dataMT)
plt.title('Microbial Meta Transcriptomics Data')
plt.xlabel('Days')

[ ]:

# Pivot the data so each 'genus' is a column
dataMT_pivot = dataMT.pivot(
    index='date', columns='genus', values='val').fillna(0)

# Pivot the MG data so each 'genus' is a column
dataMG_pivot = dataMG.pivot(
    index='date', columns='genus', values='val').fillna(0)

[ ]:

# Create a DataFrame with a complete range of days
all_days = pd.DataFrame(
    {'days': range(dataMG['date'].min(), dataMG['date'].max() + 1)})

# Reindex the pivoted DataFrame to include all days, filling missing values with None
complete_dataMG = dataMG_pivot.reindex(all_days['days'])

complete_dataMG.head(15)

[ ]:

complete_dataMG.columns

If we pick one column, say ‘Acinetobacter’, we can see that there are missing values in the dataset. We will use the GP imputator to fill in the missing values.

[ ]:

complete_dataMG['Acinetobacter'].isnull().sum()

[ ]:

# Reset the index
complete_dataMG_reset = complete_dataMG.reset_index()

# Make index as float
complete_dataMG_reset['days'] = complete_dataMG_reset['days'].astype(float)

[ ]:

# Now 'days' is a regular column and you can use it as a feature
new_dataset = GPImputer().impute_missing_values(
    complete_dataMG_reset, ['days'], ['Acidimicrobium',
                                      'Acinetobacter', 'Albidiferax'], 'Acinetobacter'
)

[ ]:

new_dataset['Acinetobacter'].isnull().sum()

We can see now that the data imputator calculated the missing values for the ‘Acinetobacter’ column. However, as we can see fro the graph, the imputed values are not great. This can be caused by multiple factors, so we are going to explore different methods to better impute the missing values.

We are first going to try to do imputation using VGP imputation. This is a variational Gaussian process imputation method that is more robust to missing data and uses multiple imputs to calculate the missing values. However, as we will see, this method creates an imputation that is heavily biased by other imputs, so it doesn’t work in our case to determine the missing values of the ‘Acinetobacter’ column.

[ ]:

# Now 'days' is a regular column and you can use it as a feature
new_dataset = GPImputer().impute_missing_values(
    complete_dataMG_reset, ['days'], ['Acidimicrobium', 'Acinetobacter', 'Albidiferax',
                                      'Candidatus Microthrix', 'Chitinophaga pinensis',
                                      'Dechloromonas', 'Haliscomenobacter', 'Intrasporangium',
                                      'Leptospira', 'Other', 'Xanthomonas', 'mean abundance < 2%'], 'Acinetobacter'
)

If we, isntead, do a single imputation using only a GPR model with a single imput (‘Acinetobacter’), we can see that the imputation is much better. This is because the imputation is not biased by other imputs, and the model can better predict the missing values.

[ ]:

# Now 'days' is a regular column and you can use it as a feature
new_dataset2 = GPImputer().impute_missing_values(
    complete_dataMG_reset, ['days'], ['Acinetobacter'], 'Acinetobacter'
)

[ ]:

# Save the dataset (new_dataset2) to a csv file
new_dataset2.to_csv(r'imputed_data.csv', index=False)

So we will proceed to do a single imput GPR model for all the columns in the dataset. We will then use the VAR inference method to infer the causal relationships between the variables in the time series.

[ ]:

# do the imputation for each column (genus) in the dataset
# Copy the original dataset
new_dataset_complete = complete_dataMG_reset.copy()

for genus in complete_dataMG.columns:
    # Impute the missing values for the current column
    imputed_data = GPImputer().impute_missing_values(
        complete_dataMG_reset, ['days'], [genus], genus
    )

    # Update the column in the new dataset with the imputed data
    new_dataset_complete[genus] = imputed_data[genus]

# Save the dataset (new_dataset_complete) to a csv file
new_dataset_complete.to_csv('imputed_data.csv', index=False)

Imputing Metabolite Data (Figure 4) using Gaussian Process Imputer¶

[2]:

metabolites = pd.read_csv(r'Source Data/fig4.tsv', delimiter='\t')

This dataset contains several columns, each representing different attributes related to microbial activity and metabolite concentrations. Here is a detailed explanation of each column:

date: The date when the sample was collected.
L1: The functional category or pathway associated with the data point.
mp_tot: Total metaproteomic count for that date and pathway.
mt_tot: Total metatranscriptomic count for that date and pathway.
L1sum_MP: Sum of metaproteomic counts for that specific functional category (L1).
L1sum_MT: Sum of metatranscriptomic counts for that specific functional category (L1).
L1rel_MP: Relative abundance of metaproteomic counts within the functional category.
L1rel_MT: Relative abundance of metatranscriptomic counts within the functional category.
MP_std_L1: Standard deviation of metaproteomic counts for that functional category.
MT_std_L1: Standard deviation of metatranscriptomic counts for that functional category.
cor: Correlation value between metaproteomic and metatranscriptomic data for that pathway.
newL1: A reformatted version of the functional category name, often including a correlation coefficient (r value) for easier readability in the figure.

We will be using the L1rel_MP and L1rel_MT columns in this example to use it with the VAR inference method to infer the causal relationships between the variables in the time series with the meta transcriptomics (MT) and meta genomics (MG) data.

[3]:

# Letr's create 2 dataframes, one for the L1rel_MP and the other for the L1rel_MT data, using the date, L1 and L2tel values

dataL1_MP = metabolites[['date', 'L1', 'L1rel_MP']]

dataL1_MT = metabolites[['date', 'L1', 'L1rel_MT']]
dataL1_MP.head(15)

[3]:

	date	L1	L1rel_MP
0	2011-03-21	01_Fermentation	0.096899
1	2011-03-21	02_Homoacetogenesis	0.009890
2	2011-03-21	03_Superpathway of thiosulfate metabolism (Des...	0.000579
3	2011-03-21	04_Utililization of sugar, conversion of pento...	0.000535
4	2011-03-21	05_Fatty acid oxidation	0.025483
5	2011-03-21	06_Amino acid utilization biosynthesis metabolism	0.122784
6	2011-03-21	07_Nucleic acid metabolism	0.119531
7	2011-03-21	08_Hydrocarbon degradation	0.005524
8	2011-03-21	09_Carbohydrate Active enzyme - CAZy	0.003921
9	2011-03-21	10_TCA cycle	0.045398
10	2011-03-21	11_Nitrogen cycle	0.026063
11	2011-03-21	12_Transporters	0.015593
12	2011-03-21	13_Hydrogen metabolism	0.000846
13	2011-03-21	14_Methanogenesis	0.003520
14	2011-03-21	15_Methylotrophy	0.009267

[4]:

# Now let's visualize the data

sns.set(style="whitegrid")
plt.figure(figsize=(12, 6))
sns.lineplot(x='date', y='L1rel_MP', hue='L1', data=dataL1_MP)
plt.title('Metaproteomic Data')
plt.xlabel('Days')
plt.ylabel('Relative Abundance')
plt.show()

c:\ProgramData\anaconda3\envs\MIMIC\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
c:\ProgramData\anaconda3\envs\MIMIC\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_33_1.png

[5]:

# Same for the metatranscriptomic data

sns.set(style="whitegrid")
plt.figure(figsize=(12, 6))
sns.lineplot(x='date', y='L1rel_MT', hue='L1', data=dataL1_MT)
plt.title('Metatranscriptomic Data')
plt.xlabel('Days')
plt.ylabel('Relative Abundance')
plt.show()

c:\ProgramData\anaconda3\envs\MIMIC\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
c:\ProgramData\anaconda3\envs\MIMIC\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_34_1.png

[6]:

# And do the same as we did for the MG and MT data sets

# Rename Nan in the L1 column to 'Other'

dataL1_MP['L1'] = dataL1_MP['L1'].replace(np.nan, 'Other')

# Change relative Date columns to days
dataL1_MP['date'] = pd.to_datetime(dataL1_MP['date'])
dataL1_MP['date'] = (dataL1_MP['date'] - dataL1_MP['date'].min()).dt.days

# Change the 'L1' column to a categorical column
dataL1_MP['L1'] = dataL1_MP['L1'].astype('category')

C:\Users\User\AppData\Local\Temp\ipykernel_32172\300627487.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataL1_MP['L1'] = dataL1_MP['L1'].replace(np.nan, 'Other')
C:\Users\User\AppData\Local\Temp\ipykernel_32172\300627487.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataL1_MP['date'] = pd.to_datetime(dataL1_MP['date'])
C:\Users\User\AppData\Local\Temp\ipykernel_32172\300627487.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataL1_MP['date'] = (dataL1_MP['date'] - dataL1_MP['date'].min()).dt.days
C:\Users\User\AppData\Local\Temp\ipykernel_32172\300627487.py:12: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataL1_MP['L1'] = dataL1_MP['L1'].astype('category')

[7]:

# do the same for the L1rel_MT data set

# Rename Nan in the L1 column to 'Other'

dataL1_MT['L1'] = dataL1_MT['L1'].replace(np.nan, 'Other')

# Change relative Date columns to days
dataL1_MT['date'] = pd.to_datetime(dataL1_MT['date'])
dataL1_MT['date'] = (dataL1_MT['date'] - dataL1_MT['date'].min()).dt.days

# Change the 'L1' column to a categorical column
dataL1_MT['L1'] = dataL1_MT['L1'].astype('category')

# Plot the data, relative abundance (value) vs. days (date) for each L1


sns.set(style="whitegrid")
plt.figure(figsize=(12, 6))
sns.lineplot(x='date', y='L1rel_MP', hue='L1', data=dataL1_MP)
plt.title('Metaproteomic Data')
plt.xlabel('Days')
plt.ylabel('Relative Abundance')
plt.show()

C:\Users\User\AppData\Local\Temp\ipykernel_32172\4211173979.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataL1_MT['L1'] = dataL1_MT['L1'].replace(np.nan, 'Other')
C:\Users\User\AppData\Local\Temp\ipykernel_32172\4211173979.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataL1_MT['date'] = pd.to_datetime(dataL1_MT['date'])
C:\Users\User\AppData\Local\Temp\ipykernel_32172\4211173979.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataL1_MT['date'] = (dataL1_MT['date'] - dataL1_MT['date'].min()).dt.days
C:\Users\User\AppData\Local\Temp\ipykernel_32172\4211173979.py:12: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataL1_MT['L1'] = dataL1_MT['L1'].astype('category')
c:\ProgramData\anaconda3\envs\MIMIC\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
c:\ProgramData\anaconda3\envs\MIMIC\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
c:\ProgramData\anaconda3\envs\MIMIC\Lib\site-packages\seaborn\_oldcore.py:1057: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  grouped_data = data.groupby(

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_36_1.png

[8]:

# And now pivot the data so each 'L1' is a column for each data set

dataL1_MP_pivot = dataL1_MP.pivot(
    index='date', columns='L1', values='L1rel_MP').fillna(0)

dataL1_MT_pivot = dataL1_MT.pivot(
    index='date', columns='L1', values='L1rel_MT').fillna(0)

[9]:

# Step 1: Determine the minimum and maximum index
min_index = dataL1_MP_pivot.index.min()
max_index = dataL1_MP_pivot.index.max()

# Step 2: Create a new index range from min_index to max_index
new_index = range(int(min_index), int(max_index) +
                  1)  # Ensure it's an integer range

# Step 3: Re-index the DataFrame
dataL1_MP_complete = dataL1_MP_pivot.reindex(new_index)

dataL1_MP_complete.head(15)

[9]:

L1	01_Fermentation	02_Homoacetogenesis	03_Superpathway of thiosulfate metabolism (Desulfovibrio sulfodismutans)	04_Utililization of sugar, conversion of pentose to EMP pathway intermediates	05_Fatty acid oxidation	06_Amino acid utilization biosynthesis metabolism	07_Nucleic acid metabolism	08_Hydrocarbon degradation	09_Carbohydrate Active enzyme - CAZy	10_TCA cycle	...	12_Transporters	13_Hydrogen metabolism	14_Methanogenesis	15_Methylotrophy	16_Embden Meyerhof-Parnos (EMP)	17_Gluconeogenesis	18_Sulfur compounds metabolism	19_Saccharide and derivated synthesis	20_Hydrolysis of polymers	21_Cellular response to stress
date
0	0.096899	0.009890	0.000579	0.000535	0.025483	0.122784	0.119531	0.005524	0.003921	0.045398	...	0.015593	0.000846	0.003520	0.009267	0.030473	0.016261	0.005346	0.009267	0.006193	0.037334
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	0.133734	0.011645	0.000340	0.000870	0.033575	0.129235	0.108779	0.004688	0.003063	0.052367	...	0.018224	0.001059	0.004159	0.016334	0.035428	0.018678	0.003592	0.009150	0.006692	0.042688
9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
11	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
12	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
13	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
14	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

15 rows × 21 columns

[10]:

# Reset the index
complete_dataMP_reset = dataL1_MP_complete.reset_index()

# Make index as float
complete_dataMP_reset['date'] = complete_dataMP_reset['date'].astype(float)

We are going to use the GP imputer to fill in the missing values in the dataset. We will then use the VAR inference method to infer the causal relationships between the variables in the time series. Note how we are only using the matern (M32, M12, and M52) and radial basis function (RBF) kernels in the GP imputer. This is because these kernels are the most suitable for time series data.

[11]:

# do the imputation for each column (metabolite) in the dataset
# Copy the original dataset
new_dataset_complete = complete_dataMP_reset.copy()

for metabolite in dataL1_MP_complete.columns:
    # Impute the missing values for the current column
    imputed_data = GPImputer().impute_missing_values(
        complete_dataMP_reset, ['date'], [metabolite], metabolite, kernel=['M32', 'M12', 'M52', 'RBF']
    )

    # Update the column in the new dataset with the imputed data
    new_dataset_complete[metabolite] = imputed_data[metabolite]

Model: GPR
Best kernel: <class 'gpflow.kernels.stationaries.Matern52'>
BIC: 305.14796205640255

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_41_1.png

Model: GPR
Best kernel: <class 'gpflow.kernels.stationaries.Matern52'>
BIC: 539.9907420516718

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_41_3.png

Model: GPR
Best kernel: <class 'gpflow.kernels.stationaries.Matern12'>
BIC: 593.6954527249973

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_41_5.png

Model: GPR
Best kernel: <class 'gpflow.kernels.stationaries.Matern52'>
BIC: 606.429158574828

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_41_7.png

Model: GPR
Best kernel: <class 'gpflow.kernels.stationaries.Matern12'>
BIC: 389.2225921347663

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_41_9.png

Model: GPR
Best kernel: <class 'gpflow.kernels.stationaries.Matern52'>
BIC: 321.45473434185317

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_41_11.png

Model: GPR
Best kernel: <class 'gpflow.kernels.stationaries.Matern32'>
BIC: 292.9620322625907

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_41_13.png

Model: GPR
Best kernel: <class 'gpflow.kernels.stationaries.Matern12'>
BIC: 558.9765129518931

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_41_15.png

Model: GPR
Best kernel: <class 'gpflow.kernels.stationaries.Matern32'>
BIC: 581.8255133135826

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_41_17.png

Model: GPR
Best kernel: <class 'gpflow.kernels.stationaries.Matern32'>
BIC: 384.214449988753

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_41_19.png

Model: GPR
Best kernel: <class 'gpflow.kernels.stationaries.Matern32'>
BIC: 448.7262491062683

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_41_21.png

Model: GPR
Best kernel: <class 'gpflow.kernels.stationaries.Matern32'>
BIC: 475.6907363223153

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_41_23.png

Model: GPR
Best kernel: <class 'gpflow.kernels.stationaries.Matern32'>
BIC: 595.2739514099666

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_41_25.png

Model: GPR
Best kernel: <class 'gpflow.kernels.stationaries.Matern12'>
BIC: 523.8038498902097

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_41_27.png

Model: GPR
Best kernel: <class 'gpflow.kernels.stationaries.Matern52'>
BIC: 492.72580947714255

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_41_29.png

Model: GPR
Best kernel: <class 'gpflow.kernels.stationaries.Matern32'>
BIC: 404.635080730377

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_41_31.png

Model: GPR
Best kernel: <class 'gpflow.kernels.stationaries.Matern32'>
BIC: 488.89321087777034

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_41_33.png

Model: GPR
Best kernel: <class 'gpflow.kernels.stationaries.Matern12'>
BIC: 443.5037096251963

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_41_35.png

Model: GPR
Best kernel: <class 'gpflow.kernels.stationaries.Matern32'>
BIC: 466.12061932819296

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_41_37.png

Model: GPR
Best kernel: <class 'gpflow.kernels.stationaries.Matern12'>
BIC: 520.4635688670087

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_41_39.png

Model: GPR
Best kernel: <class 'gpflow.kernels.stationaries.Matern32'>
BIC: 405.9096561312746

../../../_images/notebooks_MultiModel_Herold_examples_impute_data_41_41.png

[12]:

# Save the dataset (new_dataset_complete) to a csv file
new_dataset_complete.to_csv('MP_metabolite_imputed_data.csv', index=False)

[ ]:

# Let's do the same for the MT dataset

# Step 1: Determine the minimum and maximum index
min_index = dataL1_MT_pivot.index.min()
max_index = dataL1_MT_pivot.index.max()

# Step 2: Create a new index range from min_index to max_index
new_index = range(int(min_index), int(max_index) +
                  1)  # Ensure it's an integer range

# Step 3: Re-index the DataFrame
dataL1_MT_complete = dataL1_MT_pivot.reindex(new_index)

dataL1_MT_complete.head(15)

[ ]:

# Reset the index
complete_dataMT_reset = dataL1_MT_complete.reset_index()

# Make index as float
complete_dataMT_reset['date'] = complete_dataMT_reset['date'].astype(float)

[ ]:

# do the imputation for each column (metabolite) in the dataset
# Copy the original dataset
new_dataset_complete = complete_dataMT_reset.copy()

for metabolite in dataL1_MT_complete.columns:
    # Impute the missing values for the current column
    imputed_data = GPImputer().impute_missing_values(
        complete_dataMT_reset, ['date'], [metabolite], metabolite, kernel=['M32', 'M12', 'M52', 'RBF']
    )

    # Update the column in the new dataset with the imputed data
    new_dataset_complete[metabolite] = imputed_data[metabolite]

# Save the dataset (new_dataset_complete) to a csv file
new_dataset_complete.to_csv('MT_metabolite_imputed_data.csv', index=False)

Data imputation on a Real Public Dataset, Herold et al. (2020)¶

Figure 3 - Community Structure and Function Dynamics:¶

Figure 4 - Metabolite Levels and Environmental Dynamics:¶

Imputing Microbial Community Data (Figure 3) using Gaussian Process Imputer¶

Separate the data into MG and MT datasets¶

Imputing Metabolite Data (Figure 4) using Gaussian Process Imputer¶

MIMIC

Navigation

Related Topics