Tutorial 2 - Zebrafish retina

In [1]:

                
                    Copied!
                    
import linmo.resample as resample
import linmo.plot as plot
import linmo.resample as resample
import linmo.plot as plot

First, we will import the zebrafish retina dataset, He et al., 2014. This dataset consists of 60 trees, acquired by confocal time lapse microscopy of zebrafish retinal progenitors developing from 32 through 72 hours post fertilization, spanning the nasal-temporal axis. During this period, the vast majority of progenitors exit the cell cycle and terminally differentiate to form the major neuronal and glial cell types (ganglion, amacrine, bipolar, photoreceptor, horizontal, and Müller glia, abbreviated G, A, B, R, H, and M here respectively).

The data should first be formatted into a list where each entry is a tree is represented in NEWICK format, without branch lengths or interior nodes, separated by semi-colons.

Here, we have different datasets corresponding to progenitor developing in different parts of the eye (nasal, medial, and temporal), so we will load these together using the multi_dataset_resample_trees function. This function will resample the dataset num_resamples times with replacement, automatically detect all cell fates across all provided datasets (or can take a list of input cell fates to use) and count the number of subtree occurences for each doublet, triplet, or quartet.

It will output the subtree dictionary, a list of the detected cell fates, and a DataFrame that lists the number of occurrences for each subtree in each resample and in the original dataset. The DataFrame will also contain the expected number of occurrences for each subtree based on the probabilities of observing each of its constituent cell fates and contain labels corresponding to which dataset each line corresponds to.

In [2]:

                
                    Copied!
                    
datasets = ['datasets/zebrafish_retina_temporal.txt',
            'datasets/zebrafish_retina_middle.txt',
            'datasets/zebrafish_retina_nasal.txt']

dataset_names = ['Temporal region', 'Middle region', "Nasal region"]
datasets = ['datasets/zebrafish_retina_temporal.txt',
            'datasets/zebrafish_retina_middle.txt',
            'datasets/zebrafish_retina_nasal.txt']

dataset_names = ['Temporal region', 'Middle region', "Nasal region"]

Doublet motif analysis¶

In [3]:

                
                    Copied!
                    
                        
                        
                    
                    

            
(subtree_dict, 
 cell_fates, 
 dfs_dataset_c) = resample.multi_dataset_resample_trees(datasets, 
                                                        dataset_names,
                                                        'doublet',
                                                        num_resamples=10000, 
                                                        replacement_bool=True,
                                                        cell_fates='auto',
                                                        )
(subtree_dict, 
 cell_fates, 
 dfs_dataset_c) = resample.multi_dataset_resample_trees(datasets, 
                                                        dataset_names,
                                                        'doublet',
                                                        num_resamples=10000, 
                                                        replacement_bool=True,
                                                        cell_fates='auto',
                                                        )

100%|██████████| 10000/10000 [00:08<00:00, 1231.01it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
100%|██████████| 10000/10000 [00:06<00:00, 1587.53it/s]
100%|██████████| 6/6 [00:00<00:00, 548.90it/s]
100%|██████████| 1/1 [00:00<00:00, 3075.00it/s]
100%|██████████| 10000/10000 [00:05<00:00, 1906.07it/s]
100%|██████████| 11/11 [00:00<00:00, 545.58it/s]
100%|██████████| 2/2 [00:00<00:00, 4080.06it/s]
100%|██████████| 3/3 [00:23<00:00,  7.95s/it]

In [4]:

                
                    Copied!
                    
subtree_dict
subtree_dict

Out[4]:

{'(A,A)': 0,
 '(A,B)': 1,
 '(A,G)': 2,
 '(A,H)': 3,
 '(A,M)': 4,
 '(A,R)': 5,
 '(B,B)': 6,
 '(B,G)': 7,
 '(B,H)': 8,
 '(B,M)': 9,
 '(B,R)': 10,
 '(G,G)': 11,
 '(G,H)': 12,
 '(G,M)': 13,
 '(G,R)': 14,
 '(H,H)': 15,
 '(H,M)': 16,
 '(H,R)': 17,
 '(M,M)': 18,
 '(M,R)': 19,
 '(R,R)': 20}

In [5]:

                
                    Copied!
                    
cell_fates
cell_fates

Out[5]:

['A', 'B', 'G', 'H', 'M', 'R']

In [6]:

                
                    Copied!
                    
dfs_dataset_c.head()
dfs_dataset_c.head()

Out[6]:

	0	1	2	3	4	5	6	7	8	9	...	9993	9994	9995	9996	9997	9998	9999	observed	expected	dataset
0	6.0	3.0	5.0	4.0	7.0	3.0	2.0	9.0	5.0	1.0	...	3.0	3.0	6.0	3.0	6.0	1.0	4.0	9.0	3.722652	Temporal region
1	4.0	4.0	3.0	3.0	1.0	4.0	6.0	9.0	3.0	5.0	...	2.0	8.0	2.0	5.0	8.0	8.0	7.0	2.0	5.118646	Temporal region
2	1.0	0.0	3.0	2.0	2.0	2.0	5.0	1.0	1.0	1.0	...	3.0	2.0	2.0	2.0	1.0	1.0	1.0	0.0	1.628660	Temporal region
3	2.0	3.0	3.0	1.0	2.0	5.0	2.0	2.0	3.0	3.0	...	3.0	0.0	2.0	4.0	1.0	2.0	4.0	2.0	2.791989	Temporal region
4	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	...	1.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.232666	Temporal region

5 rows × 10003 columns

We will now reformat the DataFrame for plotting. The next function multi_dataset_dfs_for_plotting generates several DataFrames.

df_true_melt_dataset_label_c_c will contain various characteristics about each of the subtrees (i.e., the observed and expected number, z-score, adjusted p-value, etc.)
df_melt_subset_c_c will contain the number of occurences for each subtree in all of the resamples.
df_melt_100resamples_subset_c_c will contain the number of occurences for each subtree in only 100 resamples.
df_null_zscores_i_c_melt_subset_c_c will contain the null z-score for each subtree in num_null resamples. The null z-scores are calculated by comparison of each resample set to the rest of the resample sets.
df_null_zscores_i_c_melt_100resamples_subset_c_c will contain the null z-score for each subtree in 100 resamples.

We can also specify a cutoff number for how many subtrees we would like to include in the final plot. We can use None to keep all subtrees, auto to keep all subtrees with an absolute z-score above 1, or any number of top significant subtrees to keep (in order of highest to lowest absolute z-score).

In [7]:

                
                    Copied!
                    
                        
                        
                    
                    

            
(df_true_melt_dataset_label_c_c,
 df_melt_subset_c_c, 
 df_melt_100resamples_subset_c_c,
 df_null_zscores_i_c_melt_subset_c_c,
 df_null_zscores_i_c_melt_100resamples_subset_c_c) = plot.multi_dataset_dfs_for_plotting(dfs_dataset_c, 
                                                                                        dataset_names, 
                                                                                        10000, 
                                                                                        subtree_dict,
                                                                                        cutoff='auto',
                                                                                        num_null=100)
(df_true_melt_dataset_label_c_c,
 df_melt_subset_c_c, 
 df_melt_100resamples_subset_c_c,
 df_null_zscores_i_c_melt_subset_c_c,
 df_null_zscores_i_c_melt_100resamples_subset_c_c) = plot.multi_dataset_dfs_for_plotting(dfs_dataset_c, 
                                                                                        dataset_names, 
                                                                                        10000, 
                                                                                        subtree_dict,
                                                                                        cutoff='auto',
                                                                                        num_null=100)

100%|██████████| 100/100 [00:02<00:00, 45.62it/s]
100%|██████████| 100/100 [00:02<00:00, 45.19it/s]
100%|██████████| 100/100 [00:02<00:00, 45.74it/s]

In [8]:

                
                    Copied!
                    
df_true_melt_dataset_label_c_c.head()
df_true_melt_dataset_label_c_c.head()

Out[8]:

	subtree_val	observed	expected	z-score	abs z-score	label	null mean	null max	p_val	adj_p_val_fdr_bh	adj_p_val_fdr_tsbh	dataset	null z-score min	null z-score mean	null z-score max
15	15	5.0	0.523498	6.228141	6.228141	(H,H)	0.5222	5.0	0.000200	0.002100	0.001400	Temporal region	-0.726380	0.010907	3.448576
6	6	9.0	1.759535	5.532958	5.532958	(B,B)	1.7754	8.0	0.000100	0.002100	0.001400	Temporal region	-1.359883	-0.111358	2.470436
1	1	2.0	5.118646	-1.444152	1.444152	(A,B)	5.0986	14.0	0.106679	0.248917	0.165945	Temporal region	-2.377075	-0.162488	2.751621
20	20	21.0	8.728601	4.548063	4.548063	(R,R)	8.7274	21.0	0.000300	0.002100	0.001400	Temporal region	-2.123083	0.186302	2.696244
0	0	9.0	3.722652	2.841655	2.841655	(A,A)	3.6972	12.0	0.010498	0.034193	0.022795	Temporal region	-1.981737	-0.014536	3.379631

In [9]:

                
                    Copied!
                    
df_melt_subset_c_c.head()
df_melt_subset_c_c.head()

Out[9]:

	subtree_val	observed	label	dataset
150000	15	0.0	(H,H)	Temporal region
150001	15	0.0	(H,H)	Temporal region
150002	15	1.0	(H,H)	Temporal region
150003	15	1.0	(H,H)	Temporal region
150004	15	1.0	(H,H)	Temporal region

In [10]:

                
                    Copied!
                    
df_melt_100resamples_subset_c_c.head()
df_melt_100resamples_subset_c_c.head()

Out[10]:

	subtree_val	observed	label	dataset
1500	15	0.0	(H,H)	Temporal region
1501	15	0.0	(H,H)	Temporal region
1502	15	1.0	(H,H)	Temporal region
1503	15	1.0	(H,H)	Temporal region
1504	15	1.0	(H,H)	Temporal region

In [11]:

                
                    Copied!
                    
df_null_zscores_i_c_melt_subset_c_c.head()
df_null_zscores_i_c_melt_subset_c_c.head()

Out[11]:

	subtree_val	observed	label	dataset
1500	15	-0.726380	(H,H)	Temporal region
1501	15	-0.726380	(H,H)	Temporal region
1502	15	0.664617	(H,H)	Temporal region
1503	15	0.664617	(H,H)	Temporal region
1504	15	0.664617	(H,H)	Temporal region

In [12]:

                
                    Copied!
                    
df_null_zscores_i_c_melt_100resamples_subset_c_c.head()
df_null_zscores_i_c_melt_100resamples_subset_c_c.head()

Out[12]:

	subtree_val	observed	label	dataset
1500	15	-0.726380	(H,H)	Temporal region
1501	15	-0.726380	(H,H)	Temporal region
1502	15	0.664617	(H,H)	Temporal region
1503	15	0.664617	(H,H)	Temporal region
1504	15	0.664617	(H,H)	Temporal region

We will specify a dictionary of cell fates with assigned colors for plotting purposes.

In [13]:

                
                    Copied!
                    
cell_fates
cell_fates

Out[13]:

['A', 'B', 'G', 'H', 'M', 'R']

In [14]:

                
                    Copied!
                    
                        
                        
                    
                    

            
cell_color_dict = plot.make_color_dict(cell_fates, ['#F89A3A', 
                                                    '#9C80B8', 
                                                    '#F071AB', 
                                                    '#F0E135',
                                                    '#5FC0D4', 
                                                    '#7EC352',
                                                    ])
cell_color_dict = plot.make_color_dict(cell_fates, ['#F89A3A', 
                                                    '#9C80B8', 
                                                    '#F071AB', 
                                                    '#F0E135',
                                                    '#5FC0D4', 
                                                    '#7EC352',
                                                    ])

We can plot the frequency and deviation plot using the DataFrame outputs from dfs_for_plotting and cell color dictionary for individual datasets.

In [15]:

                
                    Copied!
                    
                        
                        
                    
                    

            
plot.plot_frequency('doublet', 
                    df_true_melt_dataset_label_c_c.loc[df_true_melt_dataset_label_c_c['dataset']=='Temporal region'], 
                    df_melt_subset_c_c.loc[df_melt_subset_c_c['dataset']=='Temporal region'], 
                    df_melt_100resamples_subset_c_c.loc[df_melt_100resamples_subset_c_c['dataset']=='Temporal region'], 
                    cell_color_dict,
                    use_expected=True,
                    fdr_type='fdr_tsbh',
                    cutoff='auto', 
                    title='Temporal region doublet frequency',
                    multiple_datasets=True,
                    legend_bool=True, 
                    legend_pos='outside',
                    save=False, 
                    image_format='png',
                    dpi=300,
                    image_save_path=None)
plot.plot_frequency('doublet', 
                    df_true_melt_dataset_label_c_c.loc[df_true_melt_dataset_label_c_c['dataset']=='Temporal region'], 
                    df_melt_subset_c_c.loc[df_melt_subset_c_c['dataset']=='Temporal region'], 
                    df_melt_100resamples_subset_c_c.loc[df_melt_100resamples_subset_c_c['dataset']=='Temporal region'], 
                    cell_color_dict,
                    use_expected=True,
                    fdr_type='fdr_tsbh',
                    cutoff='auto', 
                    title='Temporal region doublet frequency',
                    multiple_datasets=True,
                    legend_bool=True, 
                    legend_pos='outside',
                    save=False, 
                    image_format='png',
                    dpi=300,
                    image_save_path=None)

In [16]:

                
                    Copied!
                    
                        
                        
                    
                    

            
plot.plot_deviation('doublet', 
                    df_true_melt_dataset_label_c_c.loc[df_true_melt_dataset_label_c_c['dataset']=='Temporal region'], 
                    df_null_zscores_i_c_melt_subset_c_c.loc[df_null_zscores_i_c_melt_subset_c_c['dataset']=='Temporal region'], 
                    df_null_zscores_i_c_melt_100resamples_subset_c_c.loc[df_null_zscores_i_c_melt_100resamples_subset_c_c['dataset']=='Temporal region'], 
                    cell_color_dict,
                    fdr_type='fdr_tsbh',
                    cutoff='auto', 
                    title='Temporal region deviation from resamples',
                    multiple_datasets=True,
                    legend_bool=True, 
                    legend_pos='outside',
                    save=False, 
                    image_format='png',
                    dpi=300,
                    image_save_path=None)
plot.plot_deviation('doublet', 
                    df_true_melt_dataset_label_c_c.loc[df_true_melt_dataset_label_c_c['dataset']=='Temporal region'], 
                    df_null_zscores_i_c_melt_subset_c_c.loc[df_null_zscores_i_c_melt_subset_c_c['dataset']=='Temporal region'], 
                    df_null_zscores_i_c_melt_100resamples_subset_c_c.loc[df_null_zscores_i_c_melt_100resamples_subset_c_c['dataset']=='Temporal region'], 
                    cell_color_dict,
                    fdr_type='fdr_tsbh',
                    cutoff='auto', 
                    title='Temporal region deviation from resamples',
                    multiple_datasets=True,
                    legend_bool=True, 
                    legend_pos='outside',
                    save=False, 
                    image_format='png',
                    dpi=300,
                    image_save_path=None)

We can also plot motif frequency and deviation across multiple datasets using the DataFrame outputs from dfs_for_plotting and cell color dictionary. Let's also specify a dataset color dictionary.

In [17]:

                
                    Copied!
                    
dataset_color_dict = plot.make_color_dict(dataset_names, ['#712D1B', 
                                                          '#EC7960', 
                                                          '#9E9E36', 
                                                          ])
dataset_color_dict = plot.make_color_dict(dataset_names, ['#712D1B', 
                                                          '#EC7960', 
                                                          '#9E9E36', 
                                                          ])

In [18]:

                
                    Copied!
                    
                        
                        
                    
                    

            
plot.multi_dataset_plot_deviation('doublet', 
                                  dataset_names,
                                  df_true_melt_dataset_label_c_c, 
                                  dataset_color_dict,
                                  cell_color_dict,
                                  cutoff='auto', 
                                  title='auto',
                                  legend_bool=True,
                                  legend_pos='outside',
                                  save=False, 
                                  image_format='png',
                                  dpi=300,
                                  image_save_path=None)
plot.multi_dataset_plot_deviation('doublet', 
                                  dataset_names,
                                  df_true_melt_dataset_label_c_c, 
                                  dataset_color_dict,
                                  cell_color_dict,
                                  cutoff='auto', 
                                  title='auto',
                                  legend_bool=True,
                                  legend_pos='outside',
                                  save=False, 
                                  image_format='png',
                                  dpi=300,
                                  image_save_path=None)

Triplet motif analysis¶

In [19]:

                
                    Copied!
                    
                        
                        
                    
                    

            
(subtree_dict, 
 cell_fates, 
 dfs_dataset_c) = resample.multi_dataset_resample_trees(datasets, 
                                                        dataset_names,
                                                        'triplet',
                                                        num_resamples=10000, 
                                                        replacement_bool=True, 
                                                        )
(subtree_dict, 
 cell_fates, 
 dfs_dataset_c) = resample.multi_dataset_resample_trees(datasets, 
                                                        dataset_names,
                                                        'triplet',
                                                        num_resamples=10000, 
                                                        replacement_bool=True, 
                                                        )

100%|██████████| 10000/10000 [00:04<00:00, 2184.05it/s]
100%|██████████| 108/108 [00:00<00:00, 567.50it/s]
100%|██████████| 4/4 [00:00<00:00, 5427.76it/s]
100%|██████████| 12/12 [00:00<00:00, 7155.48it/s]
100%|██████████| 10000/10000 [00:03<00:00, 3296.85it/s]
100%|██████████| 102/102 [00:00<00:00, 550.37it/s]
100%|██████████| 3/3 [00:00<00:00, 4702.13it/s]
100%|██████████| 13/13 [00:00<00:00, 6480.38it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2031.53it/s]
100%|██████████| 116/116 [00:00<00:00, 530.07it/s]
100%|██████████| 4/4 [00:00<00:00, 5181.35it/s]
100%|██████████| 16/16 [00:00<00:00, 6348.99it/s]
100%|██████████| 3/3 [00:18<00:00,  6.01s/it]

In [20]:

                
                    Copied!
                    
                        
                        
                    
                    

            
(df_true_melt_dataset_label_c_c,
 df_melt_subset_c_c, 
 df_melt_100resamples_subset_c_c,
 df_null_zscores_i_c_melt_subset_c_c,
 df_null_zscores_i_c_melt_100resamples_subset_c_c) = plot.multi_dataset_dfs_for_plotting(dfs_dataset_c, 
                                                                                        dataset_names, 
                                                                                        10000, 
                                                                                        subtree_dict,
                                                                                        cutoff=15,
                                                                                        num_null=1)
(df_true_melt_dataset_label_c_c,
 df_melt_subset_c_c, 
 df_melt_100resamples_subset_c_c,
 df_null_zscores_i_c_melt_subset_c_c,
 df_null_zscores_i_c_melt_100resamples_subset_c_c) = plot.multi_dataset_dfs_for_plotting(dfs_dataset_c, 
                                                                                        dataset_names, 
                                                                                        10000, 
                                                                                        subtree_dict,
                                                                                        cutoff=15,
                                                                                        num_null=1)

100%|██████████| 1/1 [00:00<00:00,  4.02it/s]
100%|██████████| 1/1 [00:00<00:00,  3.83it/s]
100%|██████████| 1/1 [00:00<00:00,  3.99it/s]

In [21]:

                
                    Copied!
                    
                        
                        
                    
                    

            
plot.multi_dataset_plot_deviation('triplet', 
                                  dataset_names,
                                  df_true_melt_dataset_label_c_c, 
                                  dataset_color_dict,
                                  cell_color_dict,
                                  cutoff=15, 
                                  title='auto',
                                  legend_bool=True,
                                  legend_pos='outside',
                                  save=False, 
                                  image_format='png',
                                  dpi=300,
                                  image_save_path=None)
plot.multi_dataset_plot_deviation('triplet', 
                                  dataset_names,
                                  df_true_melt_dataset_label_c_c, 
                                  dataset_color_dict,
                                  cell_color_dict,
                                  cutoff=15, 
                                  title='auto',
                                  legend_bool=True,
                                  legend_pos='outside',
                                  save=False, 
                                  image_format='png',
                                  dpi=300,
                                  image_save_path=None)

Quartet motif analysis¶

In [22]:

                
                    Copied!
                    
                        
                        
                    
                    

            
(subtree_dict, 
 cell_fates, 
 dfs_dataset_c) = resample.multi_dataset_resample_trees(datasets, 
                                                        dataset_names,
                                                        'quartet',
                                                        num_resamples=10000, 
                                                        replacement_bool=True, 
                                                        )
(subtree_dict, 
 cell_fates, 
 dfs_dataset_c) = resample.multi_dataset_resample_trees(datasets, 
                                                        dataset_names,
                                                        'quartet',
                                                        num_resamples=10000, 
                                                        replacement_bool=True, 
                                                        )

100%|██████████| 10000/10000 [00:06<00:00, 1587.96it/s]
100%|██████████| 186/186 [00:00<00:00, 520.47it/s]
100%|██████████| 12/12 [00:00<00:00, 6710.89it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2301.36it/s]
100%|██████████| 195/195 [00:00<00:00, 556.76it/s]
100%|██████████| 13/13 [00:00<00:00, 7209.57it/s]
100%|██████████| 10000/10000 [00:04<00:00, 2092.56it/s]
100%|██████████| 216/216 [00:00<00:00, 549.91it/s]
100%|██████████| 16/16 [00:00<00:00, 6935.60it/s]
100%|██████████| 3/3 [00:22<00:00,  7.50s/it]

In [23]:

                
                    Copied!
                    
                        
                        
                    
                    

            
(df_true_melt_dataset_label_c_c,
 df_melt_subset_c_c, 
 df_melt_100resamples_subset_c_c,
 df_null_zscores_i_c_melt_subset_c_c,
 df_null_zscores_i_c_melt_100resamples_subset_c_c) = plot.multi_dataset_dfs_for_plotting(dfs_dataset_c, 
                                                                                        dataset_names, 
                                                                                        10000, 
                                                                                        subtree_dict,
                                                                                        cutoff=15,
                                                                                        num_null=1)
(df_true_melt_dataset_label_c_c,
 df_melt_subset_c_c, 
 df_melt_100resamples_subset_c_c,
 df_null_zscores_i_c_melt_subset_c_c,
 df_null_zscores_i_c_melt_100resamples_subset_c_c) = plot.multi_dataset_dfs_for_plotting(dfs_dataset_c, 
                                                                                        dataset_names, 
                                                                                        10000, 
                                                                                        subtree_dict,
                                                                                        cutoff=15,
                                                                                        num_null=1)

100%|██████████| 1/1 [00:00<00:00,  1.34it/s]
100%|██████████| 1/1 [00:00<00:00,  1.26it/s]
100%|██████████| 1/1 [00:00<00:00,  1.38it/s]

In [24]:

                
                    Copied!
                    
                        
                        
                    
                    

            
plot.multi_dataset_plot_deviation('quartet', 
                                  dataset_names,
                                  df_true_melt_dataset_label_c_c, 
                                  dataset_color_dict,
                                  cell_color_dict,
                                  cutoff=15, 
                                  title='auto',
                                  legend_bool=True,
                                  legend_pos='outside',
                                  save=False, 
                                  image_format='png',
                                  dpi=300,
                                  image_save_path=None)
plot.multi_dataset_plot_deviation('quartet', 
                                  dataset_names,
                                  df_true_melt_dataset_label_c_c, 
                                  dataset_color_dict,
                                  cell_color_dict,
                                  cutoff=15, 
                                  title='auto',
                                  legend_bool=True,
                                  legend_pos='outside',
                                  save=False, 
                                  image_format='png',
                                  dpi=300,
                                  image_save_path=None)