Tutorial 3 - Mouse blastocyst
import linmo.resample as resample
import linmo.plot as plot
First, we will import the mouse blastocyst dataset, Morris et al., 2010. This dataset consists of 337 trees, acquired by confocal time lapse microscopy of 20 mouse embryos developing from 8-cell stage to blastocyst stage (~55h). During this time period, a series of asymmetric cleavages to give rise to inside and outside cells, where inside cells will give rise to the inner cell mass and outside cells will give rise to trophectoderm (abbreviated T). The internalized cells then undertake a second cell fate decision to give rise to either epiblast or primitive endoderm (abbreviated E and P respectively).
The data should first be formatted into a list where each entry is a tree is represented in NEWICK format, without branch lengths or interior nodes, separated by semi-colons.
Here, we have different datasets corresponding to progenitors that started on the outside or inside of the embryo at 8-cell stage. We will load these together using the multi_dataset_resample_trees
function. This function will resample the dataset num_resamples
times with replacement, automatically detect all cell fates across all provided datasets (or can take a list of input cell fates to use) and count the number of subtree occurences for each doublet, triplet, or quartet.
It will output the subtree dictionary, a list of the detected cell fates, and a DataFrame that lists the number of occurrences for each subtree in each resample and in the original dataset. The DataFrame will also contain the expected number of occurrences for each subtree based on the probabilities of observing each of its constituent cell fates and contain labels corresponding to which dataset each line corresponds to.
datasets = ['datasets/mouse_blastocyst_outside.txt',
'datasets/mouse_blastocyst_inside.txt']
dataset_names = ['Outside progenitors', 'Inside progenitors']
Doublet motif analysis¶
(subtree_dict,
cell_fates,
dfs_dataset_c) = resample.multi_dataset_resample_trees(datasets,
dataset_names,
'doublet',
num_resamples=10000,
replacement_bool=True,
cell_fates='auto',
)
100%|██████████| 10000/10000 [00:33<00:00, 295.64it/s] 0it [00:00, ?it/s] 0it [00:00, ?it/s] 100%|██████████| 10000/10000 [00:11<00:00, 884.34it/s] 100%|██████████| 4/4 [00:00<00:00, 505.19it/s] 100%|██████████| 1/1 [00:00<00:00, 2368.33it/s] 100%|██████████| 2/2 [00:50<00:00, 25.02s/it]
subtree_dict
{'(A,A)': 0, '(A,E)': 1, '(A,P)': 2, '(A,T)': 3, '(E,E)': 4, '(E,P)': 5, '(E,T)': 6, '(P,P)': 7, '(P,T)': 8, '(T,T)': 9}
cell_fates
['A', 'E', 'P', 'T']
dfs_dataset_c.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 9993 | 9994 | 9995 | 9996 | 9997 | 9998 | 9999 | observed | expected | dataset | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3.0 | 4.0 | 3.0 | 4 | 6 | 4 | 2 | 3.0 | 3 | 6.0 | ... | 3.0 | 1.0 | 2.0 | 2.0 | 0.0 | 2.0 | 2.0 | 13.0 | 2.511160 | Outside progenitors |
1 | 5.0 | 0.0 | 1.0 | 1 | 2 | 2 | 3 | 1.0 | 3 | 2.0 | ... | 2.0 | 2.0 | 0.0 | 2.0 | 2.0 | 1.0 | 2.0 | 6.0 | 2.282873 | Outside progenitors |
2 | 20.0 | 10.0 | 7.0 | 13 | 8 | 8 | 10 | 12.0 | 8 | 15.0 | ... | 6.0 | 11.0 | 13.0 | 8.0 | 16.0 | 12.0 | 8.0 | 19.0 | 10.805597 | Outside progenitors |
3 | 29.0 | 36.0 | 39.0 | 35 | 31 | 26 | 29 | 38.0 | 26 | 27.0 | ... | 36.0 | 34.0 | 35.0 | 24.0 | 32.0 | 40.0 | 31.0 | 4.0 | 33.482131 | Outside progenitors |
4 | 0.0 | 0.0 | 0.0 | 1 | 1 | 1 | 1 | 0.0 | 1 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 6.0 | 0.518835 | Outside progenitors |
5 rows × 10003 columns
We will now reformat the DataFrame for plotting. The next function multi_dataset_dfs_for_plotting
generates several DataFrames.
df_true_melt_dataset_label_c_c
will contain various characteristics about each of the subtrees (i.e., the observed and expected number, z-score, adjusted p-value, etc.)df_melt_subset_c_c
will contain the number of occurences for each subtree in all of the resamples.df_melt_100resamples_subset_c_c
will contain the number of occurences for each subtree in only 100 resamples.df_null_zscores_i_c_melt_subset_c_c
will contain the null z-score for each subtree innum_null
resamples. The null z-scores are calculated by comparison of each resample set to the rest of the resample sets.df_null_zscores_i_c_melt_100resamples_subset_c_c
will contain the null z-score for each subtree in 100 resamples.
We can also specify a cutoff number for how many subtrees we would like to include in the final plot. We can use None
to keep all subtrees, auto
to keep all subtrees with an absolute z-score above 1, or any number of top significant subtrees to keep (in order of highest to lowest absolute z-score).
(df_true_melt_dataset_label_c_c,
df_melt_subset_c_c,
df_melt_100resamples_subset_c_c,
df_null_zscores_i_c_melt_subset_c_c,
df_null_zscores_i_c_melt_100resamples_subset_c_c) = plot.multi_dataset_dfs_for_plotting(dfs_dataset_c,
dataset_names,
10000,
subtree_dict,
cutoff='auto',
num_null=100)
100%|██████████| 100/100 [00:01<00:00, 74.18it/s] 100%|██████████| 100/100 [00:01<00:00, 73.77it/s]
df_true_melt_dataset_label_c_c.head()
subtree_val | observed | expected | z-score | abs z-score | label | null min | null mean | null max | p_val | adj_p_val_fdr_bh | adj_p_val_fdr_tsbh | dataset | null z-score min | null z-score mean | null z-score max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | 7 | 46.0 | 11.624203 | 10.384755 | 10.384755 | (P,P) | 1.0 | 11.6004 | 25.0 | 0.000100 | 0.000143 | 0.000014 | Outside progenitors | -2.295173 | -0.051419 | 3.443591 |
4 | 4 | 6.0 | 0.518835 | 7.530455 | 7.530455 | (E,E) | 0.0 | 0.5295 | 5.0 | 0.000100 | 0.000143 | 0.000014 | Outside progenitors | -0.728943 | -0.095627 | 3.402923 |
0 | 0 | 13.0 | 2.511160 | 6.630009 | 6.630009 | (A,A) | 0.0 | 2.5186 | 10.0 | 0.000100 | 0.000143 | 0.000014 | Outside progenitors | -1.593422 | 0.095798 | 2.202804 |
9 | 9 | 155.0 | 111.607104 | 5.400546 | 5.400546 | (T,T) | 84.0 | 111.6305 | 143.0 | 0.000100 | 0.000143 | 0.000014 | Outside progenitors | -3.193400 | 0.008634 | 2.288158 |
2 | 2 | 19.0 | 10.805597 | 2.596904 | 2.596904 | (A,P) | 1.0 | 10.7545 | 24.0 | 0.012098 | 0.015122 | 0.001512 | Outside progenitors | -2.443116 | 0.033239 | 2.913233 |
df_melt_subset_c_c.head()
subtree_val | observed | label | dataset | |
---|---|---|---|---|
70000 | 7 | 10.0 | (P,P) | Outside progenitors |
70001 | 7 | 10.0 | (P,P) | Outside progenitors |
70002 | 7 | 11.0 | (P,P) | Outside progenitors |
70003 | 7 | 10.0 | (P,P) | Outside progenitors |
70004 | 7 | 13.0 | (P,P) | Outside progenitors |
df_melt_100resamples_subset_c_c.head()
subtree_val | observed | label | dataset | |
---|---|---|---|---|
700 | 7 | 10.0 | (P,P) | Outside progenitors |
701 | 7 | 10.0 | (P,P) | Outside progenitors |
702 | 7 | 11.0 | (P,P) | Outside progenitors |
703 | 7 | 10.0 | (P,P) | Outside progenitors |
704 | 7 | 13.0 | (P,P) | Outside progenitors |
df_null_zscores_i_c_melt_subset_c_c.head()
subtree_val | observed | label | dataset | |
---|---|---|---|---|
700 | 7 | -0.483168 | (P,P) | Outside progenitors |
701 | 7 | -0.483168 | (P,P) | Outside progenitors |
702 | 7 | -0.181262 | (P,P) | Outside progenitors |
703 | 7 | -0.483168 | (P,P) | Outside progenitors |
704 | 7 | 0.422544 | (P,P) | Outside progenitors |
df_null_zscores_i_c_melt_100resamples_subset_c_c.head()
subtree_val | observed | label | dataset | |
---|---|---|---|---|
700 | 7 | -0.483168 | (P,P) | Outside progenitors |
701 | 7 | -0.483168 | (P,P) | Outside progenitors |
702 | 7 | -0.181262 | (P,P) | Outside progenitors |
703 | 7 | -0.483168 | (P,P) | Outside progenitors |
704 | 7 | 0.422544 | (P,P) | Outside progenitors |
We will specify a dictionary of cell fates with assigned colors for plotting purposes.
cell_fates
['A', 'E', 'P', 'T']
cell_color_dict = plot.make_color_dict(cell_fates, ['#7EC352',
'#5FC0D4',
'#9C80B8',
'#F89A3A',
])
We can plot the frequency and deviation plot using the DataFrame outputs from dfs_for_plotting
and cell color dictionary for individual datasets.
plot.plot_frequency('doublet',
df_true_melt_dataset_label_c_c.loc[df_true_melt_dataset_label_c_c['dataset']=='Outside progenitors'],
df_melt_subset_c_c.loc[df_melt_subset_c_c['dataset']=='Outside progenitors'],
df_melt_100resamples_subset_c_c.loc[df_melt_100resamples_subset_c_c['dataset']=='Outside progenitors'],
cell_color_dict,
use_expected=True,
fdr_type='fdr_tsbh',
cutoff='auto',
title='Outside progenitors doublet frequency',
multiple_datasets=True,
legend_bool=True,
legend_pos='outside',
save=False,
image_format='png',
dpi=300,
image_save_path=None)
plot.plot_deviation('doublet',
df_true_melt_dataset_label_c_c.loc[df_true_melt_dataset_label_c_c['dataset']=='Outside progenitors'],
df_null_zscores_i_c_melt_subset_c_c.loc[df_null_zscores_i_c_melt_subset_c_c['dataset']=='Outside progenitors'],
df_null_zscores_i_c_melt_100resamples_subset_c_c.loc[df_null_zscores_i_c_melt_100resamples_subset_c_c['dataset']=='Outside progenitors'],
cell_color_dict,
fdr_type='fdr_tsbh',
cutoff='auto',
title='Outside progenitors deviation from resamples',
multiple_datasets=True,
legend_bool=True,
legend_pos='outside',
save=False,
image_format='png',
dpi=300,
image_save_path=None)
We can also plot motif frequency and deviation across multiple datasets using the DataFrame outputs from dfs_for_plotting
and cell color dictionary. Let's also specify a dataset color dictionary.
dataset_color_dict = plot.make_color_dict(dataset_names, ['#F89A3A',
'#A23D6B',
])
plot.multi_dataset_plot_deviation('doublet',
dataset_names,
df_true_melt_dataset_label_c_c,
dataset_color_dict,
cell_color_dict,
cutoff='auto',
title='auto',
legend_bool=True,
legend_pos='outside',
save=False,
image_format='png',
dpi=300,
image_save_path=None)
Triplet motif analysis¶
(subtree_dict,
cell_fates,
dfs_dataset_c) = resample.multi_dataset_resample_trees(datasets,
dataset_names,
'triplet',
num_resamples=10000,
replacement_bool=True,
)
100%|██████████| 10000/10000 [00:30<00:00, 328.30it/s] 100%|██████████| 4/4 [00:00<00:00, 469.15it/s] 0it [00:00, ?it/s] 100%|██████████| 1/1 [00:00<00:00, 1960.87it/s] 100%|██████████| 10000/10000 [00:08<00:00, 1232.64it/s] 100%|██████████| 22/22 [00:00<00:00, 520.08it/s] 100%|██████████| 1/1 [00:00<00:00, 2336.66it/s] 100%|██████████| 4/4 [00:00<00:00, 6015.50it/s] 100%|██████████| 2/2 [00:41<00:00, 20.60s/it]
(df_true_melt_dataset_label_c_c,
df_melt_subset_c_c,
df_melt_100resamples_subset_c_c,
df_null_zscores_i_c_melt_subset_c_c,
df_null_zscores_i_c_melt_100resamples_subset_c_c) = plot.multi_dataset_dfs_for_plotting(dfs_dataset_c,
dataset_names,
10000,
subtree_dict,
cutoff=15,
num_null=1)
100%|██████████| 1/1 [00:00<00:00, 22.60it/s] 100%|██████████| 1/1 [00:00<00:00, 22.09it/s]
plot.multi_dataset_plot_deviation('triplet',
dataset_names,
df_true_melt_dataset_label_c_c,
dataset_color_dict,
cell_color_dict,
cutoff=15,
title='auto',
legend_bool=True,
legend_pos='outside',
save=False,
image_format='png',
dpi=300,
image_save_path=None)
Quartet motif analysis¶
(subtree_dict,
cell_fates,
dfs_dataset_c) = resample.multi_dataset_resample_trees(datasets,
dataset_names,
'quartet',
num_resamples=10000,
replacement_bool=True,
)
100%|██████████| 10000/10000 [00:27<00:00, 360.43it/s] 100%|██████████| 10/10 [00:00<00:00, 508.23it/s] 100%|██████████| 1/1 [00:00<00:00, 2864.96it/s] 100%|██████████| 10000/10000 [00:09<00:00, 1038.42it/s] 100%|██████████| 34/34 [00:00<00:00, 563.61it/s] 100%|██████████| 4/4 [00:00<00:00, 5254.37it/s] 100%|██████████| 2/2 [00:39<00:00, 19.96s/it]
(df_true_melt_dataset_label_c_c,
df_melt_subset_c_c,
df_melt_100resamples_subset_c_c,
df_null_zscores_i_c_melt_subset_c_c,
df_null_zscores_i_c_melt_100resamples_subset_c_c) = plot.multi_dataset_dfs_for_plotting(dfs_dataset_c,
dataset_names,
10000,
subtree_dict,
cutoff=15,
num_null=1)
100%|██████████| 1/1 [00:00<00:00, 14.29it/s] 100%|██████████| 1/1 [00:00<00:00, 13.62it/s]
plot.multi_dataset_plot_deviation('quartet',
dataset_names,
df_true_melt_dataset_label_c_c,
dataset_color_dict,
cell_color_dict,
cutoff=15,
title='auto',
legend_bool=True,
legend_pos='outside',
save=False,
image_format='png',
dpi=300,
image_save_path=None)