Skip to content

Reference

This part of the project documentation focuses on an information-oriented approach. Use it as a reference for the technical implementation of the linmo project code.

linmo.resample

Provides functions for resampling tree datasets.

This module contains the following functions:

  • sort_align_tree - Sorts and aligns trees.
  • read_dataset - Returns sorted tree dataset.
  • resample_trees_doublets - Returns subtree dictionary and DataFrame containing number of doublets across all resamples, the original trees, and the expected number (solved analytically).
  • resample_trees_triplets - Returns subtree dictionary and DataFrame containing number of triplets across all resamples, the original trees, and the expected number (solved analytically).
  • resample_trees_quartets - Returns subtree dictionary and DataFrame containing number of quartets across all resamples, the original trees, and the expected number (solved analytically).
  • resample_trees_asym_quartets - Returns subtree dictionary and DataFrame containing number of asymmetric quartets across all resamples, the original trees, and the expected number (solved analytically).
  • resample_trees_asym_quintets - Returns subtree dictionary and DataFrame containing number of asymmetric quintets across all resamples, the original trees, and the expected number (solved analytically).
  • resample_trees_asym_sextets - Returns subtree dictionary and DataFrame containing number of asymmetric sextets across all resamples, the original trees, and the expected number (solved analytically).
  • resample_trees_asym_septets - Returns subtree dictionary and DataFrame containing number of asymmetric septets across all resamples, the original trees, and the expected number (solved analytically).
  • resample_trees_sextets - Returns subtree dictionary and DataFrame containing number of sextets across all resamples, the original trees, and the expected number (solved analytically).
  • resample_trees_octets - Returns subtree dictionary and DataFrame containing number of octets across all resamples, the original trees, and the expected number (solved analytically).
  • multi_dataset_resample_trees - Returns subtree dictionary and DataFrame containing number of a defined subtree type across all resamples, the original trees, and the expected number (solved analytically) across all datasets.

multi_dataset_resample_trees(datasets, dataset_names, subtree, num_resamples=10000, replacement_bool=True, cell_fates='auto')

Performs resampling of trees, drawing with or without replacement, returning number of subtrees across all resamples, the original trees, and the expected number (solved analytically) for multiple datasets. The cell fates used are the composite set across all datasets provided.

Resampling is done as described in each of the resample_trees_subtrees functions. If cell_fates not explicitly provided, use automatically determined cell fates based on tree datasets.

Parameters:

Name Type Description Default
datasets list

List where each entry is a path to txt file of dataset. txt file should be formatted as NEWICK trees separated with semi-colons and no spaces

required
dataset_names list

List where each entry is a string representing the dataset label.

required
subtree string

type of subtree to be analyzed. Should be 'doublet', 'triplet', or 'quartet'.

required
num_resamples int

Number of resample datasets.

10000
replacement_bool bool

Sample cells with or without replacement drawing from the pool of all cells.

True
cell_fates string or list

If 'auto' (i.e. not provided by user), automatically determined based on tree dataset. User can also provide list where each entry is a string representing a cell fate.

'auto'

Returns:

Type Description
tuple

Contains the following variables.

  • subtree_dict (dict): Keys are subtrees, values are integers.
  • cell_fates (list): List where each entry is a string representing a cell fate.
  • dfs_dataset_c (list): List where each entry is a DataFrame with the following characteristics. Indexed by values from subtree_dict. Last column is dataset label. Second to last column is analytically solved expected number of each subtree. Third to last column is observed number of occurences in the original dataset. Rest of columns are the observed number of occurences in the resampled sets.
Source code in linmo/resample.py
def multi_dataset_resample_trees(datasets,
                                 dataset_names,
                                 subtree,
                                 num_resamples=10000, 
                                 replacement_bool=True, 
                                 cell_fates='auto',
                                 ):
    """Performs resampling of trees, drawing with or without replacement, returning number of subtrees across
        all resamples, the original trees, and the expected number (solved analytically) 
        **for multiple datasets**. The cell fates used are the composite set across all datasets provided.

    Resampling is done as described in each of the `resample_trees_subtrees` functions.
    If `cell_fates` not explicitly provided, use automatically determined cell fates based on tree datasets.

    Args:
        datasets (list): List where each entry is a path to txt file of dataset. 
            txt file should be formatted as NEWICK trees separated with semi-colons and no spaces
        dataset_names (list): List where each entry is a string representing the dataset label. 
        subtree (string): type of subtree to be analyzed. Should be 'doublet', 'triplet', or 'quartet'.
        num_resamples (int, optional): Number of resample datasets.
        replacement_bool (bool, optional): Sample cells with or without replacement drawing from the pool of all cells.
        cell_fates (string or list, optional): If 'auto' (i.e. not provided by user), automatically determined 
            based on tree dataset. User can also provide list where each entry is a string representing a cell fate.

    Returns:
        (tuple): Contains the following variables.
        - subtree_dict (dict): Keys are subtrees, values are integers.
        - cell_fates (list): List where each entry is a string representing a cell fate.
        - dfs_dataset_c (list): List where each entry is a DataFrame with the following characteristics.
            Indexed by values from `subtree_dict`.
            Last column is dataset label.
            Second to last column is analytically solved expected number of each subtree.
            Third to last column is observed number of occurences in the original dataset.
            Rest of columns are the observed number of occurences in the resampled sets.


    """
    # automatically determine cell fates if not explicitly provided
    if cell_fates == 'auto':
        all_trees_sorted_list = []
        for dataset in datasets:
            all_trees_sorted = read_dataset(dataset)
            all_trees_sorted_list.append(all_trees_sorted)
        all_trees_sorted_list_flattened = [i for sublist in all_trees_sorted_list for i in sublist]
        cell_fates = sorted(list(np.unique(re.findall('[A-Z]', ''.join([i for sublist in all_trees_sorted_list_flattened for i in sublist])))))

    # _make_subtree_dict functions can only handle 10 cell fates max
    if len(cell_fates)>10:
        print('warning!')

    # next, resample each dataset using composite cell fates list
    dfs_dataset_list = []
    for index, dataset in enumerate(tqdm(datasets)):
        all_trees_sorted = read_dataset(dataset)
        if subtree == 'doublet':
            (subtree_dict, cell_fates, dfs_dataset) = resample_trees_doublets(all_trees_sorted, 
                                                          num_resamples, 
                                                          replacement_bool,
                                                          cell_fates=cell_fates
                                                          )
            dfs_dataset['dataset'] = dataset_names[index]

        elif subtree == 'triplet':
            (subtree_dict, cell_fates, dfs_dataset) = resample_trees_triplets(all_trees_sorted, 
                                                          num_resamples, 
                                                          replacement_bool,
                                                          cell_fates=cell_fates
                                                          )
            dfs_dataset['dataset'] = dataset_names[index]

        elif subtree == 'quartet':
            (subtree_dict, cell_fates, dfs_dataset) = resample_trees_quartets(all_trees_sorted, 
                                                          num_resamples, 
                                                          replacement_bool,
                                                          cell_fates=cell_fates
                                                          )
            dfs_dataset['dataset'] = dataset_names[index]

        dfs_dataset_list.append(dfs_dataset)
    dfs_dataset_c = pd.concat(dfs_dataset_list)
    return (subtree_dict, cell_fates, dfs_dataset_c)

read_dataset(path)

Reads dataset txt file located at path.

Parameters:

Name Type Description Default
path string

Path to txt file of dataset. txt file should be formatted as NEWICK trees separated with semi-colons and no spaces.

required

Returns:

Name Type Description
all_trees_sorted list

List where each entry is a string representing a tree in NEWICK format. Trees are sorted using the sort_align_tree function.

Source code in linmo/resample.py
def read_dataset(path):
    """Reads dataset txt file located at `path`.

    Args:
        path (string): Path to txt file of dataset. txt file should be formatted as NEWICK trees 
            separated with semi-colons and no spaces.

    Returns:
        all_trees_sorted (list): List where each entry is a string representing a tree in NEWICK format. 
            Trees are sorted using the `sort_align_tree` function.
    """
    with open(path) as f:
        lines = f.readlines()

    all_trees_unsorted = lines[0].split(';')
    all_trees_sorted = [sort_align_tree(i) for i in all_trees_unsorted]
    return all_trees_sorted

resample_trees_asym_quartets(all_trees_sorted, num_resamples=10000, replacement_bool=True, cell_fates='auto', calc_expected=True)

Performs resampling of tree, drawing with or without replacement, returning subtree dictionary and DataFrame containing number of asymmetric quartets across all resamples, the original trees, and the expected number (solved analytically).

Resampling is done via (1) replacing each triplet with a randomly chosen triplet across all trees and (2) replacing every other cell with a randomly chosen non-triplet cell across all trees. If cell_fates not explicitly provided, use automatically determined cell fates based on tree dataset.

Parameters:

Name Type Description Default
all_trees_sorted list

List where each entry is a string representing a tree in NEWICK format. Trees are sorted using the sort_align_tree function.

required
num_resamples int

Number of resample datasets.

10000
replacement_bool bool

Sample cells with or without replacement drawing from the pool of all cells.

True
cell_fates string or list

If 'auto' (i.e. not provided by user), automatically determined based on tree dataset. User can also provide list where each entry is a string representing a cell fate.

'auto'
calc_expected Boolean

Calculate expected count if True by multiplying the marginal probabilities of each sub-pattern by the total number of subtrees

True

Returns:

Type Description
tuple

Contains the following variables.

  • asym_quartet_dict (dict): Keys are asym_quartets, values are integers.
  • cell_fates (list): List where each entry is a string representing a cell fate.
  • dfs_c (DataFrame): Indexed by values from asym_quartet_dict. Last column is analytically solved expected number of each asym_quartet. Second to last column is observed number of occurences in the original dataset. Rest of columns are the observed number of occurences in the resampled sets.
Source code in linmo/resample.py
def resample_trees_asym_quartets(all_trees_sorted, 
                            num_resamples=10000, 
                            replacement_bool=True,
                            cell_fates='auto', 
                            calc_expected=True
                           ):
    """Performs resampling of tree, drawing with or without replacement, returning subtree dictionary and DataFrame containing 
    number of asymmetric quartets across all resamples, the original trees, and the expected number (solved analytically).

    Resampling is done via (1) replacing each triplet with a randomly chosen triplet across all trees and 
    (2) replacing every other cell with a randomly chosen non-triplet cell across all trees.
    If `cell_fates` not explicitly provided, use automatically determined cell fates based on tree dataset.

    Args:
        all_trees_sorted (list): List where each entry is a string representing a tree in NEWICK format. 
            Trees are sorted using the `sort_align_tree` function.
        num_resamples (int, optional): Number of resample datasets.
        replacement_bool (bool, optional): Sample cells with or without replacement drawing from the pool of all cells.
        cell_fates (string or list, optional): If 'auto' (i.e. not provided by user), automatically determined 
            based on tree dataset. User can also provide list where each entry is a string representing a cell fate.
        calc_expected (Boolean): Calculate expected count if True by multiplying the marginal probabilities of each sub-pattern by the total number of subtrees

    Returns:
        (tuple): Contains the following variables.
        - asym_quartet_dict (dict): Keys are asym_quartets, values are integers.
        - cell_fates (list): List where each entry is a string representing a cell fate.
        - dfs_c (DataFrame): Indexed by values from `asym_quartet_dict`.
            Last column is analytically solved expected number of each asym_quartet.
            Second to last column is observed number of occurences in the original dataset.
            Rest of columns are the observed number of occurences in the resampled sets.
    """
    # automatically determine cell fates if not explicitly provided
    if cell_fates == 'auto':
        cell_fates = sorted(list(np.unique(re.findall('[A-Z]', ''.join([i for sublist in all_trees_sorted for i in sublist])))))

    # _make_subtree_dict functions can only handle 10 cell fates max
    if len(cell_fates)>10:
        print('warning!')

    asym_quartet_dict = _make_asym_quartet_dict(cell_fates)
    triplet_dict = _make_triplet_dict(cell_fates)
    cell_dict = _make_cell_dict(cell_fates)

    # store result for each rearrangement in dfs list
    dfs_asym_quartets_new = []
    df_asym_quartets_true = _make_df_asym_quartets(all_trees_sorted, asym_quartet_dict, 'observed', False)
    df_triplets_true = _make_df_triplets(all_trees_sorted, triplet_dict, 'observed', False)
    df_non_triplets_true = _make_df_non_triplets(all_trees_sorted, cell_dict, 'observed', False)

    # rearrange leaves num_resamples times
    for resample in tqdm(range(0, num_resamples)):
        triplets_true = _flatten_triplets(all_trees_sorted)
        non_triplets_true = _flatten_non_triplets(all_trees_sorted)

        # shuffle if replacement=False
        if replacement_bool==False:
            random.shuffle(triplets_true)
            random.shuffle(non_triplets_true)

        # first, replace the doublet with a symbol
        new_trees_1 = [_replace_triplets_symbol(i) for i in all_trees_sorted]
        # then, replace all other cells 
        new_trees_2 = [_replace_all(i, non_triplets_true, replacement_bool) for i in new_trees_1]
        # then, replace the symbols
        new_trees_3 = [_replace_symbols(i, triplets_true, replacement_bool) for i in new_trees_2]
        df_asym_quartets_new = _make_df_asym_quartets(new_trees_3, asym_quartet_dict, resample, False)
        dfs_asym_quartets_new.append(df_asym_quartets_new)

    dfs_c = _process_dfs_asym_quartet(df_asym_quartets_true, dfs_asym_quartets_new, num_resamples, asym_quartet_dict, triplet_dict, cell_dict, df_triplets_true, df_non_triplets_true, calc_expected)

    return (asym_quartet_dict, cell_fates, dfs_c)

resample_trees_asym_quintets(all_trees_sorted, num_resamples=10000, replacement_bool=True, cell_fates='auto', calc_expected=True)

Performs resampling of tree, drawing with or without replacement, returning subtree dictionary and DataFrame containing number of asymmetric quartets across all resamples, the original trees, and the expected number (solved analytically).

Resampling is done via (1) replacing each asymmetric quartet with a randomly chosen asymmetric quartet across all trees and (2) replacing every other cell with a randomly chosen non-asymmetric quartet cell across all trees. If cell_fates not explicitly provided, use automatically determined cell fates based on tree dataset.

Parameters:

Name Type Description Default
all_trees_sorted list

List where each entry is a string representing a tree in NEWICK format. Trees are sorted using the sort_align_tree function.

required
num_resamples int

Number of resample datasets.

10000
replacement_bool bool

Sample cells with or without replacement drawing from the pool of all cells.

True
cell_fates string or list

If 'auto' (i.e. not provided by user), automatically determined based on tree dataset. User can also provide list where each entry is a string representing a cell fate.

'auto'
calc_expected Boolean

Calculate expected count if True by multiplying the marginal probabilities of each sub-pattern by the total number of subtrees

True

Returns:

Type Description
tuple

Contains the following variables.

  • asym_quintet_dict (dict): Keys are asym_quintets, values are integers.
  • cell_fates (list): List where each entry is a string representing a cell fate.
  • dfs_c (DataFrame): Indexed by values from asym_quintet_dict. Last column is analytically solved expected number of each asym_quintet. Second to last column is observed number of occurences in the original dataset. Rest of columns are the observed number of occurences in the resampled sets.
Source code in linmo/resample.py
def resample_trees_asym_quintets(all_trees_sorted, 
                            num_resamples=10000, 
                            replacement_bool=True,
                            cell_fates='auto', 
                            calc_expected=True
                           ):
    """Performs resampling of tree, drawing with or without replacement, returning subtree dictionary and DataFrame containing 
    number of asymmetric quartets across all resamples, the original trees, and the expected number (solved analytically).

    Resampling is done via (1) replacing each asymmetric quartet with a randomly chosen asymmetric quartet across all trees and 
    (2) replacing every other cell with a randomly chosen non-asymmetric quartet cell across all trees.
    If `cell_fates` not explicitly provided, use automatically determined cell fates based on tree dataset.

    Args:
        all_trees_sorted (list): List where each entry is a string representing a tree in NEWICK format. 
            Trees are sorted using the `sort_align_tree` function.
        num_resamples (int, optional): Number of resample datasets.
        replacement_bool (bool, optional): Sample cells with or without replacement drawing from the pool of all cells.
        cell_fates (string or list, optional): If 'auto' (i.e. not provided by user), automatically determined 
            based on tree dataset. User can also provide list where each entry is a string representing a cell fate.
        calc_expected (Boolean): Calculate expected count if True by multiplying the marginal probabilities of each sub-pattern by the total number of subtrees

    Returns:
        (tuple): Contains the following variables.
        - asym_quintet_dict (dict): Keys are asym_quintets, values are integers.
        - cell_fates (list): List where each entry is a string representing a cell fate.
        - dfs_c (DataFrame): Indexed by values from `asym_quintet_dict`.
            Last column is analytically solved expected number of each asym_quintet.
            Second to last column is observed number of occurences in the original dataset.
            Rest of columns are the observed number of occurences in the resampled sets.
    """
    # automatically determine cell fates if not explicitly provided
    if cell_fates == 'auto':
        cell_fates = sorted(list(np.unique(re.findall('[A-Z]', ''.join([i for sublist in all_trees_sorted for i in sublist])))))

    # _make_subtree_dict functions can only handle 10 cell fates max
    if len(cell_fates)>10:
        print('warning, _make_subtree_dict functions can only handle 10 cell fates max!')

    asym_quintet_dict = _make_asym_quintet_dict(cell_fates)
    asym_quartet_dict = _make_asym_quartet_dict(cell_fates)
    cell_dict = _make_cell_dict(cell_fates)

    # store result for each rearrangement in dfs list
    dfs_asym_quintets_new = []
    df_asym_quintets_true = _make_df_asym_quintets(all_trees_sorted, asym_quintet_dict, 'observed', False)
    df_asym_quartets_true = _make_df_asym_quartets(all_trees_sorted, asym_quartet_dict, 'observed', False)
    df_non_asym_quartets_true = _make_df_non_asym_quartets(all_trees_sorted, cell_dict, 'observed', False)

    # rearrange leaves num_resamples times
    for resample in tqdm(range(0, num_resamples)):
        asym_quartets_true = _flatten_asym_quartets(all_trees_sorted)
        non_asym_quartets_true = _flatten_non_asym_quartets(all_trees_sorted)

        # shuffle if replacement=False
        if replacement_bool==False:
            random.shuffle(asym_quartets_true)
            random.shuffle(non_asym_quartets_true)

        # first, replace the doublet with a symbol
        new_trees_1 = [_replace_asym_quartets_symbol(i) for i in all_trees_sorted]
        # then, replace all other cells 
        new_trees_2 = [_replace_all(i, non_asym_quartets_true, replacement_bool) for i in new_trees_1]
        # then, replace the symbols
        new_trees_3 = [_replace_symbols(i, asym_quartets_true, replacement_bool) for i in new_trees_2]
        df_asym_quintets_new = _make_df_asym_quintets(new_trees_3, asym_quintet_dict, resample, False)
        dfs_asym_quintets_new.append(df_asym_quintets_new)

    dfs_c = _process_dfs_asym_quintet(df_asym_quintets_true, dfs_asym_quintets_new, num_resamples, asym_quintet_dict, asym_quartet_dict, cell_dict, df_asym_quartets_true, df_non_asym_quartets_true, calc_expected)

    return (asym_quintet_dict, cell_fates, dfs_c)

resample_trees_asym_septets(all_trees_sorted, num_resamples=10000, replacement_bool=True, cell_fates='auto', calc_expected=True)

Performs resampling of tree, drawing with or without replacement, returning subtree dictionary and DataFrame containing number of asymmetric septets across all resamples, the original trees, and the expected number (solved analytically).

Resampling is done via (1) replacing each asymmetric sextet with a randomly chosen asymmetric sextet across all trees and (2) replacing every other cell with a randomly chosen non-asymmetric sextet cell across all trees. If cell_fates not explicitly provided, use automatically determined cell fates based on tree dataset.

Parameters:

Name Type Description Default
all_trees_sorted list

List where each entry is a string representing a tree in NEWICK format. Trees are sorted using the sort_align_tree function.

required
num_resamples int

Number of resample datasets.

10000
replacement_bool bool

Sample cells with or without replacement drawing from the pool of all cells.

True
cell_fates string or list

If 'auto' (i.e. not provided by user), automatically determined based on tree dataset. User can also provide list where each entry is a string representing a cell fate.

'auto'
calc_expected Boolean

Calculate expected count if True by multiplying the marginal probabilities of each sub-pattern by the total number of subtrees

True

Returns:

Type Description
tuple

Contains the following variables.

  • asym_septet_dict (dict): Keys are asym_septets, values are integers.
  • cell_fates (list): List where each entry is a string representing a cell fate.
  • dfs_c (DataFrame): Indexed by values from asym_septet_dict. Last column is analytically solved expected number of each asym_septet. Second to last column is observed number of occurences in the original dataset. Rest of columns are the observed number of occurences in the resampled sets.
Source code in linmo/resample.py
def resample_trees_asym_septets(all_trees_sorted, 
                            num_resamples=10000, 
                            replacement_bool=True,
                            cell_fates='auto',
                            calc_expected=True
                           ):
    """Performs resampling of tree, drawing with or without replacement, returning subtree dictionary and DataFrame containing 
    number of asymmetric septets across all resamples, the original trees, and the expected number (solved analytically).

    Resampling is done via (1) replacing each asymmetric sextet with a randomly chosen asymmetric sextet across all trees and 
    (2) replacing every other cell with a randomly chosen non-asymmetric sextet cell across all trees.
    If `cell_fates` not explicitly provided, use automatically determined cell fates based on tree dataset.

    Args:
        all_trees_sorted (list): List where each entry is a string representing a tree in NEWICK format. 
            Trees are sorted using the `sort_align_tree` function.
        num_resamples (int, optional): Number of resample datasets.
        replacement_bool (bool, optional): Sample cells with or without replacement drawing from the pool of all cells.
        cell_fates (string or list, optional): If 'auto' (i.e. not provided by user), automatically determined 
            based on tree dataset. User can also provide list where each entry is a string representing a cell fate.
        calc_expected (Boolean): Calculate expected count if True by multiplying the marginal probabilities of each sub-pattern by the total number of subtrees

    Returns:
        (tuple): Contains the following variables.
        - asym_septet_dict (dict): Keys are asym_septets, values are integers.
        - cell_fates (list): List where each entry is a string representing a cell fate.
        - dfs_c (DataFrame): Indexed by values from `asym_septet_dict`.
            Last column is analytically solved expected number of each asym_septet.
            Second to last column is observed number of occurences in the original dataset.
            Rest of columns are the observed number of occurences in the resampled sets.
    """
    # automatically determine cell fates if not explicitly provided
    if cell_fates == 'auto':
        cell_fates = sorted(list(np.unique(re.findall('[A-Z]', ''.join([i for sublist in all_trees_sorted for i in sublist])))))

    # _make_subtree_dict functions can only handle 10 cell fates max
    if len(cell_fates)>10:
        print('warning, _make_subtree_dict functions can only handle 10 cell fates max!')

    asym_septet_dict = _make_asym_septet_dict(cell_fates)
    asym_sextet_dict = _make_asym_sextet_dict(cell_fates)
    cell_dict = _make_cell_dict(cell_fates)

    # store result for each rearrangement in dfs list
    dfs_asym_septets_new = []
    df_asym_septets_true = _make_df_asym_septets(all_trees_sorted, asym_septet_dict, 'observed', False)
    df_asym_sextets_true = _make_df_asym_sextets(all_trees_sorted, asym_sextet_dict, 'observed', False)
    df_non_asym_sextets_true = _make_df_non_asym_sextets(all_trees_sorted, cell_dict, 'observed', False)

    # rearrange leaves num_resamples times
    for resample in tqdm(range(0, num_resamples)):
        asym_sextets_true = _flatten_asym_sextets(all_trees_sorted)
        non_asym_sextets_true = _flatten_non_asym_sextets(all_trees_sorted)

        # shuffle if replacement=False
        if replacement_bool==False:
            random.shuffle(asym_sextets_true)
            random.shuffle(non_asym_sextets_true)

        # first, replace the doublet with a symbol
        new_trees_1 = [_replace_asym_sextets_symbol(i) for i in all_trees_sorted]
        # then, replace all other cells 
        new_trees_2 = [_replace_all(i, non_asym_sextets_true, replacement_bool) for i in new_trees_1]
        # then, replace the symbols
        new_trees_3 = [_replace_symbols(i, asym_sextets_true, replacement_bool) for i in new_trees_2]
        df_asym_septets_new = _make_df_asym_septets(new_trees_3, asym_septet_dict, resample, False)
        dfs_asym_septets_new.append(df_asym_septets_new)

    dfs_c = _process_dfs_asym_septet(df_asym_septets_true, dfs_asym_septets_new, num_resamples, asym_septet_dict, asym_sextet_dict, cell_dict, df_asym_sextets_true, df_non_asym_sextets_true, calc_expected)

    return (asym_septet_dict, cell_fates, dfs_c)

resample_trees_asym_sextets(all_trees_sorted, num_resamples=10000, replacement_bool=True, cell_fates='auto', calc_expected=True)

Performs resampling of tree, drawing with or without replacement, returning subtree dictionary and DataFrame containing number of asymmetric quintets across all resamples, the original trees, and the expected number (solved analytically).

Resampling is done via (1) replacing each asymmetric quintet with a randomly chosen asymmetric quintet across all trees and (2) replacing every other cell with a randomly chosen non-asymmetric quintet cell across all trees. If cell_fates not explicitly provided, use automatically determined cell fates based on tree dataset.

Parameters:

Name Type Description Default
all_trees_sorted list

List where each entry is a string representing a tree in NEWICK format. Trees are sorted using the sort_align_tree function.

required
num_resamples int

Number of resample datasets.

10000
replacement_bool bool

Sample cells with or without replacement drawing from the pool of all cells.

True
cell_fates string or list

If 'auto' (i.e. not provided by user), automatically determined based on tree dataset. User can also provide list where each entry is a string representing a cell fate.

'auto'
calc_expected Boolean

Calculate expected count if True by multiplying the marginal probabilities of each sub-pattern by the total number of subtrees

True

Returns:

Type Description
tuple

Contains the following variables.

  • asym_sextet_dict (dict): Keys are asym_sextets, values are integers.
  • cell_fates (list): List where each entry is a string representing a cell fate.
  • dfs_c (DataFrame): Indexed by values from asym_sextet_dict. Last column is analytically solved expected number of each asym_sextet. Second to last column is observed number of occurences in the original dataset. Rest of columns are the observed number of occurences in the resampled sets.
Source code in linmo/resample.py
def resample_trees_asym_sextets(all_trees_sorted, 
                            num_resamples=10000, 
                            replacement_bool=True,
                            cell_fates='auto', 
                            calc_expected=True
                           ):
    """Performs resampling of tree, drawing with or without replacement, returning subtree dictionary and DataFrame containing 
    number of asymmetric quintets across all resamples, the original trees, and the expected number (solved analytically).

    Resampling is done via (1) replacing each asymmetric quintet with a randomly chosen asymmetric quintet across all trees and 
    (2) replacing every other cell with a randomly chosen non-asymmetric quintet cell across all trees.
    If `cell_fates` not explicitly provided, use automatically determined cell fates based on tree dataset.

    Args:
        all_trees_sorted (list): List where each entry is a string representing a tree in NEWICK format. 
            Trees are sorted using the `sort_align_tree` function.
        num_resamples (int, optional): Number of resample datasets.
        replacement_bool (bool, optional): Sample cells with or without replacement drawing from the pool of all cells.
        cell_fates (string or list, optional): If 'auto' (i.e. not provided by user), automatically determined 
            based on tree dataset. User can also provide list where each entry is a string representing a cell fate.
        calc_expected (Boolean): Calculate expected count if True by multiplying the marginal probabilities of each sub-pattern by the total number of subtrees

    Returns:
        (tuple): Contains the following variables.
        - asym_sextet_dict (dict): Keys are asym_sextets, values are integers.
        - cell_fates (list): List where each entry is a string representing a cell fate.
        - dfs_c (DataFrame): Indexed by values from `asym_sextet_dict`.
            Last column is analytically solved expected number of each asym_sextet.
            Second to last column is observed number of occurences in the original dataset.
            Rest of columns are the observed number of occurences in the resampled sets.
    """
    # automatically determine cell fates if not explicitly provided
    if cell_fates == 'auto':
        cell_fates = sorted(list(np.unique(re.findall('[A-Z]', ''.join([i for sublist in all_trees_sorted for i in sublist])))))

    # _make_subtree_dict functions can only handle 10 cell fates max
    if len(cell_fates)>10:
        print('warning, _make_subtree_dict functions can only handle 10 cell fates max!')

    asym_sextet_dict = _make_asym_sextet_dict(cell_fates)
    asym_quintet_dict = _make_asym_quintet_dict(cell_fates)
    cell_dict = _make_cell_dict(cell_fates)

    # store result for each rearrangement in dfs list
    dfs_asym_sextets_new = []
    df_asym_sextets_true = _make_df_asym_sextets(all_trees_sorted, asym_sextet_dict, 'observed', False)
    df_asym_quintets_true = _make_df_asym_quintets(all_trees_sorted, asym_quintet_dict, 'observed', False)
    df_non_asym_quintets_true = _make_df_non_asym_quintets(all_trees_sorted, cell_dict, 'observed', False)

    # rearrange leaves num_resamples times
    for resample in tqdm(range(0, num_resamples)):
        asym_quintets_true = _flatten_asym_quintets(all_trees_sorted)
        non_asym_quintets_true = _flatten_non_asym_quintets(all_trees_sorted)

        # shuffle if replacement=False
        if replacement_bool==False:
            random.shuffle(asym_quintets_true)
            random.shuffle(non_asym_quintets_true)

        # first, replace the doublet with a symbol
        new_trees_1 = [_replace_asym_quintets_symbol(i) for i in all_trees_sorted]
        # then, replace all other cells 
        new_trees_2 = [_replace_all(i, non_asym_quintets_true, replacement_bool) for i in new_trees_1]
        # then, replace the symbols
        new_trees_3 = [_replace_symbols(i, asym_quintets_true, replacement_bool) for i in new_trees_2]
        df_asym_sextets_new = _make_df_asym_sextets(new_trees_3, asym_sextet_dict, resample, False)
        dfs_asym_sextets_new.append(df_asym_sextets_new)

    dfs_c = _process_dfs_asym_sextet(df_asym_sextets_true, dfs_asym_sextets_new, num_resamples, asym_sextet_dict, asym_quintet_dict, cell_dict, df_asym_quintets_true, df_non_asym_quintets_true, calc_expected)

    return (asym_sextet_dict, cell_fates, dfs_c)

resample_trees_doublets(all_trees_sorted, num_resamples=10000, replacement_bool=True, cell_fates='auto', calc_expected=True)

Performs resampling of trees, drawing with or without replacement, returning subtree dictionary and DataFrame containing number of doublets across all resamples, the original trees, and the expected number (solved analytically).

Resampling is done by replacing each cell fate with a randomly chosen cell fate across all trees. If cell_fates not explicitly provided, use automatically determined cell fates based on tree dataset.

Parameters:

Name Type Description Default
all_trees_sorted list

List where each entry is a string representing a tree in NEWICK format. Trees are sorted using the sort_align_tree function.

required
num_resamples int

Number of resample datasets.

10000
replacement_bool bool

Sample cells with or without replacement drawing from the pool of all cells.

True
cell_fates string or list

If 'auto' (i.e. not provided by user), automatically determined based on tree dataset. User can also provide list where each entry is a string representing a cell fate.

'auto'
calc_expected Boolean

Calculate expected count if True by multiplying the marginal probabilities of each sub-pattern by the total number of subtrees

True

Returns:

Type Description
tuple

Contains the following variables.

  • doublet_dict (dict): Keys are doublets, values are integers.
  • cell_fates (list): List where each entry is a string representing a cell fate.
  • dfs_c (DataFrame): Indexed by values from doublet_dict. Last column is analytically solved expected number of each doublet. Second to last column is observed number of occurences in the original dataset. Rest of columns are the observed number of occurences in the resampled sets.
Source code in linmo/resample.py
def resample_trees_doublets(all_trees_sorted, 
                            num_resamples=10000, 
                            replacement_bool=True, 
                            cell_fates='auto', 
                            calc_expected=True
                            ):
    """Performs resampling of trees, drawing with or without replacement, returning subtree dictionary and DataFrame containing
    number of doublets across all resamples, the original trees, and the expected number (solved analytically).

    Resampling is done by replacing each cell fate with a randomly chosen cell fate across all trees.
    If `cell_fates` not explicitly provided, use automatically determined cell fates based on tree dataset.


    Args:
        all_trees_sorted (list): List where each entry is a string representing a tree in NEWICK format. 
            Trees are sorted using the `sort_align_tree` function.
        num_resamples (int, optional): Number of resample datasets.
        replacement_bool (bool, optional): Sample cells with or without replacement drawing from the pool of all cells.
        cell_fates (string or list, optional): If 'auto' (i.e. not provided by user), automatically determined 
            based on tree dataset. User can also provide list where each entry is a string representing a cell fate.
        calc_expected (Boolean): Calculate expected count if True by multiplying the marginal probabilities of each sub-pattern by the total number of subtrees

    Returns:
        (tuple): Contains the following variables.
        - doublet_dict (dict): Keys are doublets, values are integers.
        - cell_fates (list): List where each entry is a string representing a cell fate.
        - dfs_c (DataFrame): Indexed by values from `doublet_dict`.
            Last column is analytically solved expected number of each doublet.
            Second to last column is observed number of occurences in the original dataset.
            Rest of columns are the observed number of occurences in the resampled sets.


    """
    # automatically determine cell fates if not explicitly provided
    if cell_fates == 'auto':
        cell_fates = sorted(list(np.unique(re.findall('[A-Z]', ''.join([i for sublist in all_trees_sorted for i in sublist])))))

    # _make_subtree_dict functions can only handle 10 cell fates max
    if len(cell_fates)>10:
        print('warning!')

    doublet_dict = _make_doublet_dict(cell_fates)
    cell_dict = _make_cell_dict(cell_fates)

    # store result for each rearrangement in dfs list
    dfs_doublets_new = []
    df_doublets_true = _make_df_doublets(all_trees_sorted, doublet_dict, 'observed', False)
    df_all_cells_true = _make_df_all_cells(all_trees_sorted, cell_dict, 'observed', False)

    # rearrange leaves num_resamples times
    for resample in tqdm(range(0, num_resamples)):
        all_cells_true = _flatten_all_cells(all_trees_sorted)

        # shuffle if replacement=False
        if replacement_bool==False:
            random.shuffle(all_cells_true)

        new_trees = [_replace_all(i, all_cells_true, replacement_bool) for i in all_trees_sorted]
        df_doublets_new = _make_df_doublets(new_trees, doublet_dict, resample, False)
        dfs_doublets_new.append(df_doublets_new)

    dfs_c = _process_dfs_doublet(df_doublets_true, dfs_doublets_new, num_resamples, doublet_dict, cell_dict, df_all_cells_true, calc_expected=True)

    return (doublet_dict, cell_fates, dfs_c)

resample_trees_octets(all_trees_sorted, num_resamples=10000, replacement_bool=True, cell_fates='auto', calc_expected=True)

Performs resampling of tree, drawing with or without replacement, returning subtree dictionary and DataFrame containing the number of octets across all resamples, the original trees, and the expected number (solved analytically).

Resampling is done via replacing each quartet with a randomly chosen quartet from across all trees. If cell_fates not explicitly provided, use automatically determined cell fates based on tree dataset.

Parameters:

Name Type Description Default
all_trees_sorted list

List where each entry is a string representing a tree in NEWICK format. Trees are sorted using the sort_align_tree function.

required
num_resamples int

Number of resample datasets.

10000
replacement_bool bool

Sample cells with or without replacement drawing from the pool of all cells.

True
cell_fates string or list

If 'auto' (i.e. not provided by user), automatically determined based on tree dataset. User can also provide list where each entry is a string representing a cell fate.

'auto'
calc_expected Boolean

Calculate expected count if True by multiplying the marginal probabilities of each sub-pattern by the total number of subtrees

True

Returns:

Type Description
tuple

Contains the following variables.

  • octet_dict (dict): Keys are octets, values are integers.
  • cell_fates (list): List where each entry is a string representing a cell fate.
  • dfs_c (DataFrame): Indexed by values from octet_dict. Last column is analytically solved expected number of each octet. Second to last column is observed number of occurences in the original dataset. Rest of columns are the observed number of occurences in the resampled sets.
Source code in linmo/resample.py
def resample_trees_octets(all_trees_sorted, 
                            num_resamples=10000, 
                            replacement_bool=True,
                            cell_fates='auto', 
                            calc_expected=True
                           ):
    """Performs resampling of tree, drawing with or without replacement, returning subtree dictionary and DataFrame containing 
    the number of octets across all resamples, the original trees, and the expected number (solved analytically).

    Resampling is done via replacing each quartet with a randomly chosen quartet from across all trees.
    If `cell_fates` not explicitly provided, use automatically determined cell fates based on tree dataset.

    Args:
        all_trees_sorted (list): List where each entry is a string representing a tree in NEWICK format. 
            Trees are sorted using the `sort_align_tree` function.
        num_resamples (int, optional): Number of resample datasets.
        replacement_bool (bool, optional): Sample cells with or without replacement drawing from the pool of all cells.
        cell_fates (string or list, optional): If 'auto' (i.e. not provided by user), automatically determined 
            based on tree dataset. User can also provide list where each entry is a string representing a cell fate.
        calc_expected (Boolean): Calculate expected count if True by multiplying the marginal probabilities of each sub-pattern by the total number of subtrees

    Returns:
        (tuple): Contains the following variables.
        - octet_dict (dict): Keys are octets, values are integers.
        - cell_fates (list): List where each entry is a string representing a cell fate.
        - dfs_c (DataFrame): Indexed by values from `octet_dict`.
            Last column is analytically solved expected number of each octet.
            Second to last column is observed number of occurences in the original dataset.
            Rest of columns are the observed number of occurences in the resampled sets.


    """
    # automatically determine cell fates if not explicitly provided
    if cell_fates == 'auto':
        cell_fates = sorted(list(np.unique(re.findall('[A-Z]', ''.join([i for sublist in all_trees_sorted for i in sublist])))))

    # _make_subtree_dict functions can only handle 10 cell fates max
    if len(cell_fates)>10:
        print('warning!')

    octet_dict = _make_octet_dict(cell_fates)
    quartet_dict = _make_quartet_dict(cell_fates)

    # store result for each rearrangement in dfs list
    dfs_octets_new = []
    df_octets_true = _make_df_octets(all_trees_sorted, octet_dict, 'observed', False)
    df_quartets_true = _make_df_quartets(all_trees_sorted, quartet_dict, 'observed', False)

    # rearrange leaves num_resamples times
    for resample in tqdm(range(0, num_resamples)):
        quartets_true = _flatten_quartets(all_trees_sorted)

        # shuffle if replacement=False
        if replacement_bool==False:
            random.shuffle(quartets_true)

        new_trees = [_replace_quartets(i, quartets_true, replacement_bool) for i in all_trees_sorted]
        df_octets_new = _make_df_octets(new_trees, octet_dict, resample, False)
        dfs_octets_new.append(df_octets_new)

    dfs_c = _process_dfs_octet(df_octets_true, dfs_octets_new, num_resamples, octet_dict, quartet_dict, df_quartets_true, calc_expected)

    return (octet_dict, cell_fates, dfs_c)

resample_trees_quartets(all_trees_sorted, num_resamples=10000, replacement_bool=True, cell_fates='auto', calc_expected=True)

Performs resampling of tree, drawing with or without replacement, returning subtree dictionary and DataFrame containing the number of quartets across all resamples, the original trees, and the expected number (solved analytically).

Resampling is done via replacing each doublet with a randomly chosen doublet from across all trees. If cell_fates not explicitly provided, use automatically determined cell fates based on tree dataset.

Parameters:

Name Type Description Default
all_trees_sorted list

List where each entry is a string representing a tree in NEWICK format. Trees are sorted using the sort_align_tree function.

required
num_resamples int

Number of resample datasets.

10000
replacement_bool bool

Sample cells with or without replacement drawing from the pool of all cells.

True
cell_fates string or list

If 'auto' (i.e. not provided by user), automatically determined based on tree dataset. User can also provide list where each entry is a string representing a cell fate.

'auto'
calc_expected Boolean

Calculate expected count if True by multiplying the marginal probabilities of each sub-pattern by the total number of subtrees

True

Returns:

Type Description
tuple

Contains the following variables.

  • quartet_dict (dict): Keys are quartets, values are integers.
  • cell_fates (list): List where each entry is a string representing a cell fate.
  • dfs_c (DataFrame): Indexed by values from quartet_dict. Last column is analytically solved expected number of each quartet. Second to last column is observed number of occurences in the original dataset. Rest of columns are the observed number of occurences in the resampled sets.
Source code in linmo/resample.py
def resample_trees_quartets(all_trees_sorted, 
                            num_resamples=10000, 
                            replacement_bool=True,
                            cell_fates='auto', 
                            calc_expected=True
                           ):
    """Performs resampling of tree, drawing with or without replacement, returning subtree dictionary and DataFrame containing 
    the number of quartets across all resamples, the original trees, and the expected number (solved analytically).

    Resampling is done via replacing each doublet with a randomly chosen doublet from across all trees.
    If `cell_fates` not explicitly provided, use automatically determined cell fates based on tree dataset.

    Args:
        all_trees_sorted (list): List where each entry is a string representing a tree in NEWICK format. 
            Trees are sorted using the `sort_align_tree` function.
        num_resamples (int, optional): Number of resample datasets.
        replacement_bool (bool, optional): Sample cells with or without replacement drawing from the pool of all cells.
        cell_fates (string or list, optional): If 'auto' (i.e. not provided by user), automatically determined 
            based on tree dataset. User can also provide list where each entry is a string representing a cell fate.
        calc_expected (Boolean): Calculate expected count if True by multiplying the marginal probabilities of each sub-pattern by the total number of subtrees

    Returns:
        (tuple): Contains the following variables.
        - quartet_dict (dict): Keys are quartets, values are integers.
        - cell_fates (list): List where each entry is a string representing a cell fate.
        - dfs_c (DataFrame): Indexed by values from `quartet_dict`.
            Last column is analytically solved expected number of each quartet.
            Second to last column is observed number of occurences in the original dataset.
            Rest of columns are the observed number of occurences in the resampled sets.


    """
    # automatically determine cell fates if not explicitly provided
    if cell_fates == 'auto':
        cell_fates = sorted(list(np.unique(re.findall('[A-Z]', ''.join([i for sublist in all_trees_sorted for i in sublist])))))

    # _make_subtree_dict functions can only handle 10 cell fates max
    if len(cell_fates)>10:
        print('warning!')

    quartet_dict = _make_quartet_dict(cell_fates)
    doublet_dict = _make_doublet_dict(cell_fates)

    # store result for each rearrangement in dfs list
    dfs_quartets_new = []
    df_quartets_true = _make_df_quartets(all_trees_sorted, quartet_dict, 'observed', False)
    df_doublets_true = _make_df_doublets(all_trees_sorted, doublet_dict, 'observed', False)

    # rearrange leaves num_resamples times
    for resample in tqdm(range(0, num_resamples)):
        doublets_true = _flatten_doublets(all_trees_sorted)

        # shuffle if replacement=False
        if replacement_bool==False:
            random.shuffle(doublets_true)

        new_trees = [_replace_doublets(i, doublets_true, replacement_bool) for i in all_trees_sorted]
        df_quartets_new = _make_df_quartets(new_trees, quartet_dict, resample, False)
        dfs_quartets_new.append(df_quartets_new)

    dfs_c = _process_dfs_quartet(df_quartets_true, dfs_quartets_new, num_resamples, quartet_dict, doublet_dict, df_doublets_true, calc_expected)

    return (quartet_dict, cell_fates, dfs_c)

resample_trees_sextets(all_trees_sorted, num_resamples=10000, replacement_bool=True, cell_fates='auto', calc_expected=True)

Performs resampling of tree, drawing with or without replacement, returning subtree dictionary and DataFrame containing number of sextets across all resamples, the original trees, and the expected number (solved analytically).

Resampling is done via (1) replacing each quartet with a randomly chosen quartet across all trees and (2) replacing every other doublet with a randomly chosen non-quartet doublet across all trees. If cell_fates not explicitly provided, use automatically determined cell fates based on tree dataset.

Parameters:

Name Type Description Default
all_trees_sorted list

List where each entry is a string representing a tree in NEWICK format. Trees are sorted using the sort_align_tree function.

required
num_resamples int

Number of resample datasets.

10000
replacement_bool bool

Sample cells with or without replacement drawing from the pool of all cells.

True
cell_fates string or list

If 'auto' (i.e. not provided by user), automatically determined based on tree dataset. User can also provide list where each entry is a string representing a cell fate.

'auto'
calc_expected Boolean

Calculate expected count if True by multiplying the marginal probabilities of each sub-pattern by the total number of subtrees

True

Returns:

Type Description
tuple

Contains the following variables.

  • sextet_dict (dict): Keys are sextets, values are integers.
  • cell_fates (list): List where each entry is a string representing a cell fate.
  • dfs_c (DataFrame): Indexed by values from sextet_dict. Last column is analytically solved expected number of each sextet. Second to last column is observed number of occurences in the original dataset. Rest of columns are the observed number of occurences in the resampled sets.
Source code in linmo/resample.py
def resample_trees_sextets(all_trees_sorted, 
                            num_resamples=10000, 
                            replacement_bool=True,
                            cell_fates='auto', 
                            calc_expected=True
                           ):
    """Performs resampling of tree, drawing with or without replacement, returning subtree dictionary and DataFrame containing 
    number of sextets across all resamples, the original trees, and the expected number (solved analytically).

    Resampling is done via (1) replacing each quartet with a randomly chosen quartet across all trees and 
    (2) replacing every other doublet with a randomly chosen non-quartet doublet across all trees.
    If `cell_fates` not explicitly provided, use automatically determined cell fates based on tree dataset.

    Args:
        all_trees_sorted (list): List where each entry is a string representing a tree in NEWICK format. 
            Trees are sorted using the `sort_align_tree` function.
        num_resamples (int, optional): Number of resample datasets.
        replacement_bool (bool, optional): Sample cells with or without replacement drawing from the pool of all cells.
        cell_fates (string or list, optional): If 'auto' (i.e. not provided by user), automatically determined 
            based on tree dataset. User can also provide list where each entry is a string representing a cell fate.
        calc_expected (Boolean): Calculate expected count if True by multiplying the marginal probabilities of each sub-pattern by the total number of subtrees

    Returns:
        (tuple): Contains the following variables.
        - sextet_dict (dict): Keys are sextets, values are integers.
        - cell_fates (list): List where each entry is a string representing a cell fate.
        - dfs_c (DataFrame): Indexed by values from `sextet_dict`.
            Last column is analytically solved expected number of each sextet.
            Second to last column is observed number of occurences in the original dataset.
            Rest of columns are the observed number of occurences in the resampled sets.
    """
    # automatically determine cell fates if not explicitly provided
    if cell_fates == 'auto':
        cell_fates = sorted(list(np.unique(re.findall('[A-Z]', ''.join([i for sublist in all_trees_sorted for i in sublist])))))

    # _make_subtree_dict functions can only handle 10 cell fates max
    if len(cell_fates)>10:
        print('warning!')

    sextet_dict = _make_sextet_dict(cell_fates)
    quartet_dict = _make_quartet_dict(cell_fates)
    doublet_dict = _make_doublet_dict(cell_fates)

    # store result for each rearrangement in dfs list
    dfs_sextets_new = []
    df_sextets_true = _make_df_sextets(all_trees_sorted, sextet_dict, 'observed', False)
    df_quartets_true = _make_df_quartets(all_trees_sorted, quartet_dict, 'observed', False)
    df_doublets_non_quartets_true = _make_df_doublets_non_quartets(all_trees_sorted, doublet_dict, 'observed', False)

    # rearrange leaves num_resamples times
    for resample in tqdm(range(0, num_resamples)):
        quartets_true = _flatten_quartets(all_trees_sorted)
        doublets_non_quartets_true = _flatten_doublets_non_quartets(all_trees_sorted)

        # shuffle if replacement=False
        if replacement_bool==False:
            random.shuffle(quartets_true)
            random.shuffle(doublets_non_quartets_true)

        # first, replace the doublet with a symbol
        new_trees_1 = [_replace_quartets_symbol(i) for i in all_trees_sorted]
        # then, replace all other cells 
        new_trees_2 = [_replace_doublets(i, doublets_non_quartets_true, replacement_bool) for i in new_trees_1]
        # then, replace the symbols
        new_trees_3 = [_replace_symbols(i, quartets_true, replacement_bool) for i in new_trees_2]
        df_sextets_new = _make_df_sextets(new_trees_3, sextet_dict, resample, False)
        dfs_sextets_new.append(df_sextets_new)

    dfs_c = _process_dfs_sextet(df_sextets_true, dfs_sextets_new, num_resamples, sextet_dict, quartet_dict, doublet_dict, df_quartets_true, df_doublets_non_quartets_true, calc_expected)

    return (sextet_dict, cell_fates, dfs_c)

resample_trees_triplets(all_trees_sorted, num_resamples=10000, replacement_bool=True, cell_fates='auto', calc_expected=True)

Performs resampling of tree, drawing with or without replacement, returning subtree dictionary and DataFrame containing number of triplets across all resamples, the original trees, and the expected number (solved analytically).

Resampling is done via (1) replacing each cell with a randomly chosen non_doublet across all trees and (2) replacing each doublet with a randomly chosen doublet across all trees. If cell_fates not explicitly provided, use automatically determined cell fates based on tree dataset.

Parameters:

Name Type Description Default
all_trees_sorted list

List where each entry is a string representing a tree in NEWICK format. Trees are sorted using the sort_align_tree function.

required
num_resamples int

Number of resample datasets.

10000
replacement_bool bool

Sample cells with or without replacement drawing from the pool of all cells.

True
cell_fates string or list

If 'auto' (i.e. not provided by user), automatically determined based on tree dataset. User can also provide list where each entry is a string representing a cell fate.

'auto'
calc_expected Boolean

Calculate expected count if True by multiplying the marginal probabilities of each sub-pattern by the total number of subtrees

True

Returns:

Type Description
tuple

Contains the following variables.

  • triplet_dict (dict): Keys are triplets, values are integers.
  • cell_fates (list): List where each entry is a string representing a cell fate.
  • dfs_c (DataFrame): Indexed by values from triplet_dict. Last column is analytically solved expected number of each triplet. Second to last column is observed number of occurences in the original dataset. Rest of columns are the observed number of occurences in the resampled sets.
Source code in linmo/resample.py
def resample_trees_triplets(all_trees_sorted, 
                            num_resamples=10000, 
                            replacement_bool=True,
                            cell_fates='auto', 
                            calc_expected=True
                           ):
    """Performs resampling of tree, drawing with or without replacement, returning subtree dictionary and DataFrame containing 
    number of triplets across all resamples, the original trees, and the expected number (solved analytically).

    Resampling is done via (1) replacing each cell with a randomly chosen non_doublet across all trees and 
    (2) replacing each doublet with a randomly chosen doublet across all trees.
    If `cell_fates` not explicitly provided, use automatically determined cell fates based on tree dataset.

    Args:
        all_trees_sorted (list): List where each entry is a string representing a tree in NEWICK format. 
            Trees are sorted using the `sort_align_tree` function.
        num_resamples (int, optional): Number of resample datasets.
        replacement_bool (bool, optional): Sample cells with or without replacement drawing from the pool of all cells.
        cell_fates (string or list, optional): If 'auto' (i.e. not provided by user), automatically determined 
            based on tree dataset. User can also provide list where each entry is a string representing a cell fate.
        calc_expected (Boolean): Calculate expected count if True by multiplying the marginal probabilities of each sub-pattern by the total number of subtrees

    Returns:
        (tuple): Contains the following variables.
        - triplet_dict (dict): Keys are triplets, values are integers.
        - cell_fates (list): List where each entry is a string representing a cell fate.
        - dfs_c (DataFrame): Indexed by values from `triplet_dict`.
            Last column is analytically solved expected number of each triplet.
            Second to last column is observed number of occurences in the original dataset.
            Rest of columns are the observed number of occurences in the resampled sets.
    """
    # automatically determine cell fates if not explicitly provided
    if cell_fates == 'auto':
        cell_fates = sorted(list(np.unique(re.findall('[A-Z]', ''.join([i for sublist in all_trees_sorted for i in sublist])))))

    # _make_subtree_dict functions can only handle 10 cell fates max
    if len(cell_fates)>10:
        print('warning!')

    triplet_dict = _make_triplet_dict(cell_fates)
    doublet_dict = _make_doublet_dict(cell_fates)
    cell_dict = _make_cell_dict(cell_fates)

    # store result for each rearrangement in dfs list
    dfs_triplets_new = []
    df_triplets_true = _make_df_triplets(all_trees_sorted, triplet_dict, 'observed', False)
    df_doublets_true = _make_df_doublets(all_trees_sorted, doublet_dict, 'observed', False)
    df_non_doublets_true = _make_df_non_doublets(all_trees_sorted, cell_dict, 'observed', False)

    # rearrange leaves num_resamples times
    for resample in tqdm(range(0, num_resamples)):
        doublets_true = _flatten_doublets(all_trees_sorted)
        non_doublets_true = _flatten_non_doublets(all_trees_sorted)

        # shuffle if replacement=False
        if replacement_bool==False:
            random.shuffle(doublets_true)
            random.shuffle(non_doublets_true)

        # first, replace the doublet with a symbol
        new_trees_1 = [_replace_doublets_symbol(i) for i in all_trees_sorted]
        # then, replace all other cells 
        new_trees_2 = [_replace_all(i, non_doublets_true, replacement_bool) for i in new_trees_1]
        # then, replace the symbols
        new_trees_3 = [_replace_symbols(i, doublets_true, replacement_bool) for i in new_trees_2]
        df_triplets_new = _make_df_triplets(new_trees_3, triplet_dict, resample, False)
        dfs_triplets_new.append(df_triplets_new)

    dfs_c = _process_dfs_triplet(df_triplets_true, dfs_triplets_new, num_resamples, triplet_dict, doublet_dict, cell_dict, df_doublets_true, df_non_doublets_true, calc_expected)

    return (triplet_dict, cell_fates, dfs_c)

sort_align_tree(tree)

Sort and align provided tree.

Parameters:

Name Type Description Default
tree string

Tree in NEWICK format.

required

Returns:

Name Type Description
tree string

Tree in NEWICK format. Trees are sorted to have all asymmetric septets in (x,(x,(x,(x,(x,(x,x)))))) format, asymmetric sextets in (x,(x,(x,(x,(x,x))))) format, asymmetric quintets in (x,(x,(x,(x,x)))), asymmetric quartets in (x,(x,(x,x))) format, triplets in (x,(x,x)) format, and all octets/quartets/doublets in alphabetical order.

Source code in linmo/resample.py
def sort_align_tree(tree):
    """Sort and align provided tree. 

    Args:
        tree (string): Tree in NEWICK format.

    Returns:
        tree (string): Tree in NEWICK format.
            Trees are sorted to have all asymmetric septets in (x,(x,(x,(x,(x,(x,x)))))) format, asymmetric sextets in (x,(x,(x,(x,(x,x))))) format, 
            asymmetric quintets in (x,(x,(x,(x,x)))), asymmetric quartets in (x,(x,(x,x))) format, triplets in (x,(x,x)) format, 
            and all octets/quartets/doublets in alphabetical order.
    """
    tree = _align_asym_septet(_align_asym_sextet(_align_asym_quintet(_align_asym_quartet(_align_sextet(_sorted_octets(_sorted_quartets(_sorted_doublets(_align_triplets(tree)))))))))
    return tree

linmo.plot

Provides functions for visualizing motif analysis.

This module contains the following functions:

  • dfs_for_plotting - Takes DataFrame from resample_trees functions and returns DataFrame for plotting.
  • make_cell_color_dict - Returns cell color dictionary based on provided cell fates.
  • plot_frequency - Displays frequency plot of cutoff number of subtrees in original dataset and all resamples.
  • plot_deviation - Displays deviation plot of cutoff number of subtrees in original dataset and a subset of resamples.
  • multi_dataset_dfs_for_plotting - Takes DataFrame from multi_dataset_resample_trees function and returns DataFrames for plotting.
  • multi_dataset_plot_deviation - Displays deviation plot of cutoff number of subtrees in multiple datasets.

dfs_for_plotting(dfs_c, num_resamples, subtree_dict, cutoff='auto', num_null=1000, use_expected=True, min_cell_types=1)

Converts DataFrame from resample_trees functions into DataFrames for plotting.

Calculates z-scores by comparing the observed count in the original trees to the mean/std across all resamples. Calculates null z-scores by comparing the observed count of num_null random resamples to the mean/std across the rest of the resamples.

Parameters:

Name Type Description Default
dfs_c DataFrame

Indexed by values from subtree_dict. Last column is analytically solved expected count of each subtree. Second to last column is observed count of occurences in the original dataset. Rest of columns are the observed count of occurences in the resampled sets. Output from resample_trees functions.

required
num_resamples int

Number of resamples.

required
subtree_dict dict

Keys are subtrees, values are integers.

required
cutoff string or NoneType or int

Take cutoff number of subtrees with largest absolute z-scores to include in plots. If not provided explicitly, will be automatically determined to take all subtrees with abs z-score > 1. If NoneType, take all subtrees.

'auto'
num_null int

Take num_null number of resamples to calculate z-scores as part of null distribution.

1000
use_expected Boolean

Use expected count in DataFrame.

True
min_cell_types int

Use subtrees with only a minimal amount of different cell types.

1

Returns:

Type Description
tuple

Contains the following variables.

  • subtree_dict (dict): Keys are subtrees, values are integers. Remade using min_cell_types (excludes subtrees with lower min_cell_types).
  • df_true_melt_subset (DataFrame): DataFrame indexed by cutoff number of most significant subtrees for plotting. Sorted by z-score from most over-represented to most under-represented. Contains the following columns:
    • subtree_val (int): Value corresponding to subtree_dict.
    • observed (float): Count in original trees.
    • expected (float): Analytically solved expected count. Only included if use_expected is True.
    • z-score (float): Computed using observed values and mean/std across resamples.
    • abs z-score (float): Absolute value of z-score.
    • label (string): Key corresponding to subtree_dict.
    • null min (float): Minimum count across across all resamples.
    • null mean (float): Average count across across all resamples.
    • null max (float): Maximum count across across all resamples.
    • p_val (float): p-value, one-sided test, not corrected for multiple hypotheses testing.
    • adj_p_val_fdr_bh (float): adjusted p-value, corrected using the Benjamini and Hochberg FDR correction. Automatically set to 1 if min_cell_types > 1.
    • adj_p_val_fdr_tsbh (float): adjusted p-value, corrected using the Benjamini and Hochberg FDR correction with two stage linear step-up procedure. Automatically set to 1 if min_cell_types > 1.
    • null z-score min (float): Minimum z-score across across num_null random resamples.
    • null z-score mean (float): Average z-score across across num_null random resamples.
    • null z-score max (float): Maximum z-score across across num_null random resamples.
  • df_melt_subset (DataFrame): Melted DataFrame with observed count for cutoff number of most significant subtrees across all resamples. Contains the following columns:
    • subtree_val (int): Value corresponding to subtree_dict.
    • observed (int): Counts across all resamples.
    • label (string): Key corresponding to subtree_dict.
  • df_melt_100resamples_subset (DataFrame): Melted DataFrame with observed count for cutoff number of most significant subtrees across 100 random resamples. Contains the following columns:
    • subtree_val (int): Value corresponding to subtree_dict.
    • observed (int): Counts across 100 random resamples.
    • label (string): Key corresponding to subtree_dict.
  • df_null_zscores_i_c_melt_subset (DataFrame): Melted DataFrame with null z-score for cutoff number of most significant subtrees across num_null random resamples. Contains the following columns:
    • subtree_val (int): Value corresponding to subtree_dict.
    • observed (float): Z-scores across num_null random resamples.
    • label (string): Key corresponding to subtree_dict.
  • df_null_zscores_i_c_melt_100resamples_subset (DataFrame): Melted DataFrame with null z-score for cutoff number of most significant subtrees across 100 random resamples. Contains the following columns:
    • subtree_val (int): Value corresponding to subtree_dict.
    • observed (float): Z-scores across 100 random resamples.
    • label (string): Key corresponding to subtree_dict.
Source code in linmo/plot.py
def dfs_for_plotting(dfs_c, num_resamples, subtree_dict, cutoff='auto', num_null=1000, use_expected=True, min_cell_types=1):
    """Converts DataFrame from resample_trees functions into DataFrames for plotting.

    Calculates z-scores by comparing the observed count in the original trees to the mean/std across all resamples.
    Calculates null z-scores by comparing the observed count of `num_null` random resamples to the mean/std across the rest of 
    the resamples.

    Args:
        dfs_c (DataFrame): Indexed by values from `subtree_dict`.
            Last column is analytically solved expected count of each subtree.
            Second to last column is observed count of occurences in the original dataset.
            Rest of columns are the observed count of occurences in the resampled sets.
            Output from resample_trees functions.
        num_resamples (int): Number of resamples.
        subtree_dict (dict): Keys are subtrees, values are integers.
        cutoff (string or NoneType or int, optional): Take `cutoff` number of subtrees with largest absolute z-scores 
            to include in plots.
            If not provided explicitly, will be automatically determined to take all subtrees with abs z-score > 1.
            If NoneType, take all subtrees.
        num_null (int, optional): Take `num_null` number of resamples to calculate z-scores as part of null distribution.
        use_expected (Boolean, optional): Use expected count in DataFrame.
        min_cell_types (int, optional): Use subtrees with only a minimal amount of different cell types.

    Returns:
        (tuple): Contains the following variables.

        - subtree_dict (dict): Keys are subtrees, values are integers. Remade using min_cell_types (excludes subtrees with lower min_cell_types).
        - df_true_melt_subset (DataFrame): DataFrame indexed by `cutoff` number of most significant subtrees for plotting.
            Sorted by z-score from most over-represented to most under-represented. Contains the following columns: 
                - subtree_val (int): Value corresponding to `subtree_dict`.
                - observed (float): Count in original trees.
                - expected (float): Analytically solved expected count. Only included if use_expected is True.
                - z-score (float): Computed using observed values and mean/std across resamples.
                - abs z-score (float): Absolute value of z-score.
                - label (string): Key corresponding to `subtree_dict`.
                - null min (float): Minimum count across across all resamples.
                - null mean (float): Average count across across all resamples.
                - null max (float): Maximum count across across all resamples.
                - p_val (float): p-value, one-sided test, not corrected for multiple hypotheses testing. 
                - adj_p_val_fdr_bh (float): adjusted p-value, corrected using the Benjamini and Hochberg FDR correction. Automatically set to 1 if min_cell_types > 1.
                - adj_p_val_fdr_tsbh (float): adjusted p-value, corrected using the Benjamini and Hochberg FDR correction with two stage linear step-up procedure. Automatically set to 1 if min_cell_types > 1.
                - null z-score min (float): Minimum z-score across across `num_null` random resamples.
                - null z-score mean (float): Average z-score across across `num_null` random resamples.
                - null z-score max (float): Maximum z-score across across `num_null` random resamples.
        - df_melt_subset (DataFrame): Melted DataFrame with observed count for `cutoff` number of most significant subtrees 
            across all resamples. Contains the following columns:
                - subtree_val (int): Value corresponding to `subtree_dict`.
                - observed (int): Counts across all resamples.
                - label (string): Key corresponding to `subtree_dict`.
        - df_melt_100resamples_subset (DataFrame): Melted DataFrame with observed count for `cutoff` number of most significant
            subtrees across 100 random resamples. Contains the following columns:
                - subtree_val (int): Value corresponding to `subtree_dict`.
                - observed (int): Counts across 100 random resamples.
                - label (string): Key corresponding to `subtree_dict`.
        - df_null_zscores_i_c_melt_subset (DataFrame): Melted DataFrame with null z-score for `cutoff` number of most significant
            subtrees across `num_null` random resamples. Contains the following columns:
                - subtree_val (int): Value corresponding to `subtree_dict`.
                - observed (float): Z-scores across `num_null` random resamples.
                - label (string): Key corresponding to `subtree_dict`.
        - df_null_zscores_i_c_melt_100resamples_subset (DataFrame): Melted DataFrame with null z-score for `cutoff` number of 
            most significant subtrees across 100 random resamples. Contains the following columns:
                - subtree_val (int): Value corresponding to `subtree_dict`.
                - observed (float): Z-scores across 100 random resamples.
                - label (string): Key corresponding to `subtree_dict`.
    """

    # remake subtree_dict based on min_cell_types
    subtree_ss = []
    for i in subtree_dict.items():
        cell_types = set(re.findall("[A-Za-z0-9]+", i[0]))
        if len(cell_types) >= min_cell_types:
            subtree_ss.append(i)

    subtree_dict = {}
    for i, j in enumerate(subtree_ss):
        subtree_dict[j[0]] = i

    # subset dfs_c by subtree_dict
    dfs_c = dfs_c.loc[[i[1] for i in subtree_ss]].reset_index(drop=True)

    # slice out the subtrees of the original trees
    df_true_slice = dfs_c.loc[:,'observed']

    # dataframe of original trees
    data = {'subtree_val': df_true_slice.index,
            'observed': df_true_slice.values}
    df_true_melt = pd.DataFrame(data)

    # slice out the subtrees of the original trees
    if use_expected == True:
        expected = dfs_c.loc[:,'expected'].values

    # dataframe of resampled trees
    resamples = num_resamples - 1
    df_melt = pd.melt(dfs_c.loc[:,'0':f'{resamples}'].transpose(), var_name='subtree_val', value_name='observed')
    df_melt_100resamples = pd.melt(dfs_c.loc[:,'0':'99'].transpose(), var_name='subtree_val', value_name='observed')

    # calculate zscores
    zscores = []
    for i in tqdm(df_true_slice.index):
        actual = df_true_slice[i]
        mean = np.mean(df_melt.loc[df_melt['subtree_val']==i]['observed'].values)
        std = np.std(df_melt.loc[df_melt['subtree_val']==i]['observed'].values)
        if std == 0:
            zscore = 0
        else:
            zscore = (actual - mean) / std
        zscores.append(zscore)

    # assign to dataframe and subset based on subtrees with top 10 significance values
    if use_expected == True:
        df_true_melt['expected'] = expected
    df_true_melt['z-score'] = zscores
    df_true_melt['abs z-score'] = abs(df_true_melt['z-score'])
    df_true_melt.fillna(0, inplace=True)
    df_true_melt.sort_values('abs z-score', axis=0, ascending=False, inplace=True)

    # subset based on the number of subtrees
    if cutoff == 'auto':
        cutoff = (df_true_melt['abs z-score'].values>1).sum()
        df_true_melt_subset = df_true_melt.iloc[:cutoff].copy()
    elif cutoff == None:
        df_true_melt_subset = df_true_melt
    else:
        df_true_melt_subset = df_true_melt.iloc[:cutoff].copy()

    df_true_melt_subset.sort_values('z-score', axis=0, ascending=False, inplace=True)
    df_true_melt_subset['label'] = [list(subtree_dict.keys())[i] for i in df_true_melt_subset['subtree_val'].values]

    # exit early if all z-scores are 0
    if (df_true_melt_subset['z-score'] == 0).all() == True:
        return (subtree_dict, df_true_melt_subset, False, False, False, False)

    # subset the resamples
    df_melt_subset_list = []
    for i in df_true_melt_subset['subtree_val']:
        df_melt_subtree = df_melt.loc[df_melt['subtree_val']==i].copy()
        df_melt_subtree['label']=list(subtree_dict.keys())[i]
        df_melt_subset_list.append(df_melt_subtree)
    df_melt_subset = pd.concat(df_melt_subset_list)

    df_melt_100resamples_subset_list = []
    for i in df_true_melt_subset['subtree_val']:
        df_melt_100resamples_subtree = df_melt_100resamples.loc[df_melt_100resamples['subtree_val']==i].copy()
        df_melt_100resamples_subtree['label']=list(subtree_dict.keys())[i]
        df_melt_100resamples_subset_list.append(df_melt_100resamples_subtree)
    df_melt_100resamples_subset = pd.concat(df_melt_100resamples_subset_list)

    df_true_melt_subset['null min'] = [df_melt_subset.groupby(['subtree_val']).min(numeric_only=True).loc[i].values[0] for i in df_true_melt_subset['subtree_val']]
    df_true_melt_subset['null mean'] = [df_melt_subset.groupby(['subtree_val']).mean(numeric_only=True).loc[i].values[0] for i in df_true_melt_subset['subtree_val']]
    df_true_melt_subset['null max'] = [df_melt_subset.groupby(['subtree_val']).max(numeric_only=True).loc[i].values[0] for i in df_true_melt_subset['subtree_val']]

    # calculate p-value (one-sided test)
    p_val_list = []
    for i, j in zip(df_true_melt_subset['subtree_val'].values, df_true_melt_subset['z-score'].values):
        resamples = dfs_c.iloc[i].values[:-1]
        actual = df_true_melt_subset.loc[df_true_melt_subset['subtree_val']==i]['observed'].values[0]
        if j > 0:
            pos = sum(resamples>=actual)
        elif j < 0:
            pos = sum(resamples<=actual)
        elif j == 0:
            pos=len(resamples)

        p_val = pos/len(resamples)
        p_val_list.append(p_val)

    df_true_melt_subset['p_val'] = p_val_list
    if min_cell_types == 1:
        df_true_melt_subset['adj_p_val_fdr_bh'] = multipletests(p_val_list, method='fdr_bh')[1]
        df_true_melt_subset['adj_p_val_fdr_tsbh'] = multipletests(p_val_list, method='fdr_tsbh')[1]
    elif min_cell_types > 1:
        df_true_melt_subset['adj_p_val_fdr_bh'] = 1
        df_true_melt_subset['adj_p_val_fdr_tsbh'] = 1

    # calculate deviation of each resample
    df_null_zscores_i_list = []
    for i in tqdm(range(num_null)):
        df_true_slice_i = dfs_c[f'{i}'].copy()
        data = {'subtree_val': df_true_slice_i.index,
                'observed': df_true_slice_i.values}
        df_true_melt_i = pd.DataFrame(data)

        if use_expected == True:
            df_subset_i = dfs_c[dfs_c.columns[~dfs_c.columns.isin([f'{i}','observed', 'expected'])]].copy()
        else:
            df_subset_i = dfs_c[dfs_c.columns[~dfs_c.columns.isin([f'{i}','observed'])]].copy()
        df_melt_i = pd.melt(df_subset_i.transpose(), var_name='subtree_val', value_name='observed')

        zscores_i = []
        for j in df_true_slice_i.index:
            actual = df_true_slice_i[j]
            mean = np.mean(df_melt_i.loc[df_melt_i['subtree_val']==j]['observed'].values)
            std = np.std(df_melt_i.loc[df_melt_i['subtree_val']==j]['observed'].values)
            if std == 0:
                zscore = 0
            else:
                zscore = (actual - mean) / std
            zscores_i.append(zscore)

        df_null_zscores_i = pd.DataFrame(zscores_i, columns=[i])
        df_null_zscores_i_list.append(df_null_zscores_i)

    df_null_zscores_i_c = pd.concat(df_null_zscores_i_list, axis=1)
    df_null_zscores_i_c.fillna(0, inplace=True)

    df_null_zscores_i_c_melt = df_null_zscores_i_c.transpose().melt(var_name='subtree_val', value_name='observed')
    df_null_zscores_i_c_melt_100resamples = df_null_zscores_i_c.loc[:,:99].transpose().melt(var_name='subtree_val', value_name='observed')

    # subset the resamples
    df_null_zscores_i_c_melt_subset_list = []
    for i in df_true_melt_subset['subtree_val']:
        df_null_zscores_i_c_melt_subtree = df_null_zscores_i_c_melt.loc[df_null_zscores_i_c_melt['subtree_val']==i].copy()
        df_null_zscores_i_c_melt_subtree['label']=list(subtree_dict.keys())[i]
        df_null_zscores_i_c_melt_subset_list.append(df_null_zscores_i_c_melt_subtree)
    df_null_zscores_i_c_melt_subset = pd.concat(df_null_zscores_i_c_melt_subset_list)

    # subset the resamples
    df_null_zscores_i_c_melt_100resamples_subset_list = []
    for i in df_true_melt_subset['subtree_val']:
        df_null_zscores_i_c_melt_100resamples_subtree = df_null_zscores_i_c_melt_100resamples.loc[df_null_zscores_i_c_melt_100resamples['subtree_val']==i].copy()
        df_null_zscores_i_c_melt_100resamples_subtree['label']=list(subtree_dict.keys())[i]
        df_null_zscores_i_c_melt_100resamples_subset_list.append(df_null_zscores_i_c_melt_100resamples_subtree)
    df_null_zscores_i_c_melt_100resamples_subset = pd.concat(df_null_zscores_i_c_melt_100resamples_subset_list)

    df_true_melt_subset['null z-score min'] = [df_null_zscores_i_c_melt_subset.groupby(['subtree_val']).min(numeric_only=True).loc[i].values[0] for i in df_true_melt_subset['subtree_val']]
    df_true_melt_subset['null z-score mean'] = [df_null_zscores_i_c_melt_subset.groupby(['subtree_val']).mean(numeric_only=True).loc[i].values[0] for i in df_true_melt_subset['subtree_val']]
    df_true_melt_subset['null z-score max'] = [df_null_zscores_i_c_melt_subset.groupby(['subtree_val']).max(numeric_only=True).loc[i].values[0] for i in df_true_melt_subset['subtree_val']]

    return (subtree_dict, df_true_melt_subset, df_melt_subset, df_melt_100resamples_subset, df_null_zscores_i_c_melt_subset, df_null_zscores_i_c_melt_100resamples_subset)

make_color_dict(labels, colors)

Makes color dictionary based on provided labels (can be cell types or dataset names).

If cell_fates not provided, use automatically determined cell fates based on tree dataset.

Parameters:

Name Type Description Default
- labels (list

List of string labels.

required
- colors (list

List of string color codes.

required

Returns:

Name Type Description
color_dict dict

Keys are labels, values are colors.

Source code in linmo/plot.py
def make_color_dict(labels, colors):
    """Makes color dictionary based on provided labels (can be cell types or dataset names).

    If cell_fates not provided, use automatically determined cell fates based on tree dataset.

    Args:
        - labels (list): List of string labels.
        - colors (list): List of string color codes.

    Returns:
        color_dict (dict): Keys are labels, values are colors.

    """
    color_dict = dict(zip(labels, colors))
    return color_dict

multi_dataset_dfs_for_plotting(dfs_dataset_c, dataset_names, num_resamples, subtree_dict, cutoff='auto', num_null=1000)

Converts DataFrame from multi_dataset_resample_trees function into DataFrames for plotting.

Calculates z-scores by comparing the observed count in the original trees to the mean/std across all resamples. Calculates null z-scores by comparing the observed count of num_null random resamples to the mean/std across the rest of the resamples.

Parameters:

Name Type Description Default
dfs_dataset_c list

List where each entry is a DataFrame with the following characteristics. Indexed by values from subtree_dict. Last column is dataset label. Second to last column is analytically solved expected count of each subtree. Third to last column is observed count of occurences in the original dataset. Rest of columns are the observed count of occurences in the resampled sets. Output from multi_dataset_resample_trees function.

required
dataset_names list

List where each entry is a string representing the dataset label.

required
num_resamples int

Number of resamples.

required
subtree_dict dict

Keys are subtrees, values are integers.

required
cutoff string or NoneType or int

Takes cutoff number of subtrees with largest absolute z-scores across all datasets to include in plots. If not provided explicitly, will be automatically determined to take all subtrees with abs z-score > 1 in at least one of the datasets provided. If NoneType, take all subtrees.

'auto'
num_null int

Takes num_null number of resamples to calculate z-scores as part of null distribution.

1000

Returns:

Type Description
tuple

Contains the following DataFrames.

  • df_true_melt_dataset_label_c_c (DataFrame): DataFrame indexed by cutoff number of most significant subtrees for plotting. Sorted by z-score from most over-represented to most under-represented (using the most extreme z-score for each subtree across all datasets provided). Contains the following columns:
    • subtree_val (int): Value corresponding to subtree_dict.
    • observed (float): Count in original trees.
    • expected (float): Analytically solved expected count.
    • z-score (float): Computed using observed values and mean/std across resamples.
    • abs z-score (float): Absolute value of z-score.
    • label (string): Key corresponding to subtree_dict.
    • null min (float): Minimum count across across all resamples.
    • null mean (float): Average count across across all resamples.
    • null max (float): Maximum count across across all resamples.
    • p_val (float): p-value, one-sided test, not corrected for multiple hypotheses testing.
    • adj_p_val_fdr_bh (float): adjusted p-value, corrected using the Benjamini and Hochberg FDR correction
    • adj_p_val_fdr_tsbh (float): adjusted p-value, corrected using the Benjamini and Hochberg FDR correction with two stage linear step-up procedure
    • dataset (string): Dataset label.
    • null z-score min (float): Minimum z-score across across num_null random resamples.
    • null z-score mean (float): Average z-score across across num_null random resamples.
    • null z-score max (float): Maximum z-score across across num_null random resamples.
  • df_melt_subset_c_c (DataFrame): Melted DataFrame with observed count for cutoff number of most significant subtrees across all resamples. Contains the following columns:
    • subtree_val (int): Value corresponding to subtree_dict.
    • observed (int): Counts across all resamples.
    • label (string): Key corresponding to subtree_dict.
    • dataset (string): Dataset label.
  • df_melt_100resamples_subset_c_c (DataFrame): Melted DataFrame with observed count for cutoff number of most significant subtrees across 100 random resamples. Contains the following columns:
    • subtree_val (int): Value corresponding to subtree_dict.
    • observed (int): Counts across 100 random resamples.
    • label (string): Key corresponding to subtree_dict.
    • dataset (string): Dataset label.
  • df_null_zscores_i_c_melt_subset_c_c (DataFrame): Melted DataFrame with null z-score for cutoff number of most significant subtrees across num_null random resamples. Contains the following columns:
    • subtree_val (int): Value corresponding to subtree_dict.
    • observed (float): Z-scores across num_null random resamples.
    • label (string): Key corresponding to subtree_dict.
    • dataset (string): Dataset label.
  • df_null_zscores_i_c_melt_100resamples_subset_c_c (DataFrame): Melted DataFrame with null z-score for cutoff number of most significant subtrees across 100 random resamples. Contains the following columns:
    • subtree_val (int): Value corresponding to subtree_dict.
    • observed (float): Z-scores across 100 random resamples.
    • label (string): Key corresponding to subtree_dict.
    • dataset (string): Dataset label.
Source code in linmo/plot.py
def multi_dataset_dfs_for_plotting(dfs_dataset_c, 
                                   dataset_names, 
                                   num_resamples, 
                                   subtree_dict, 
                                   cutoff='auto', 
                                   num_null=1000):
    """Converts DataFrame from `multi_dataset_resample_trees` function into DataFrames for plotting.

    Calculates z-scores by comparing the observed count in the original trees to the mean/std across all resamples.
    Calculates null z-scores by comparing the observed count of `num_null` random resamples to the mean/std across the rest of 
    the resamples.

    Args:
        dfs_dataset_c (list): List where each entry is a DataFrame with the following characteristics.
            Indexed by values from `subtree_dict`.
            Last column is dataset label.
            Second to last column is analytically solved expected count of each subtree.
            Third to last column is observed count of occurences in the original dataset.
            Rest of columns are the observed count of occurences in the resampled sets.
            Output from `multi_dataset_resample_trees` function.
        dataset_names (list): List where each entry is a string representing the dataset label. 
        num_resamples (int): Number of resamples.
        subtree_dict (dict): Keys are subtrees, values are integers.
        cutoff (string or NoneType or int, optional): Takes `cutoff` number of subtrees with largest absolute z-scores 
            across all datasets to include in plots.
            If not provided explicitly, will be automatically determined to take all subtrees with abs z-score > 1
                in at least one of the datasets provided.
            If NoneType, take all subtrees.
        num_null (int, optional): Takes `num_null` number of resamples to calculate z-scores as part of null distribution.

    Returns:
        (tuple): Contains the following DataFrames.

        - df_true_melt_dataset_label_c_c (DataFrame): DataFrame indexed by `cutoff` number of most significant subtrees for plotting.
            Sorted by z-score from most over-represented to most under-represented (using the most extreme z-score
            for each subtree across all datasets provided). Contains the following columns:
                - subtree_val (int): Value corresponding to `subtree_dict`.
                - observed (float): Count in original trees.
                - expected (float): Analytically solved expected count.
                - z-score (float): Computed using observed values and mean/std across resamples.
                - abs z-score (float): Absolute value of z-score.
                - label (string): Key corresponding to `subtree_dict`.
                - null min (float): Minimum count across across all resamples.
                - null mean (float): Average count across across all resamples.
                - null max (float): Maximum count across across all resamples.
                - p_val (float): p-value, one-sided test, not corrected for multiple hypotheses testing.
                - adj_p_val_fdr_bh (float): adjusted p-value, corrected using the Benjamini and Hochberg FDR correction
                - adj_p_val_fdr_tsbh (float): adjusted p-value, corrected using the Benjamini and Hochberg FDR correction with two stage linear step-up procedure
                - dataset (string): Dataset label.
                - null z-score min (float): Minimum z-score across across `num_null` random resamples.
                - null z-score mean (float): Average z-score across across `num_null` random resamples.
                - null z-score max (float): Maximum z-score across across `num_null` random resamples.
        - df_melt_subset_c_c (DataFrame): Melted DataFrame with observed count for `cutoff` number of most significant subtrees 
            across all resamples. Contains the following columns:
                - subtree_val (int): Value corresponding to `subtree_dict`.
                - observed (int): Counts across all resamples.
                - label (string): Key corresponding to `subtree_dict`.
                - dataset (string): Dataset label.
        - df_melt_100resamples_subset_c_c (DataFrame): Melted DataFrame with observed count for `cutoff` number of most significant
            subtrees across 100 random resamples. Contains the following columns:
                - subtree_val (int): Value corresponding to `subtree_dict`.
                - observed (int): Counts across 100 random resamples.
                - label (string): Key corresponding to `subtree_dict`.
                - dataset (string): Dataset label.
        - df_null_zscores_i_c_melt_subset_c_c (DataFrame): Melted DataFrame with null z-score for `cutoff` number of most significant
            subtrees across `num_null` random resamples. Contains the following columns:
                - subtree_val (int): Value corresponding to `subtree_dict`.
                - observed (float): Z-scores across `num_null` random resamples.
                - label (string): Key corresponding to `subtree_dict`.
                - dataset (string): Dataset label.
        - df_null_zscores_i_c_melt_100resamples_subset_c_c (DataFrame): Melted DataFrame with null z-score for `cutoff` number of 
            most significant subtrees across 100 random resamples. Contains the following columns:
                - subtree_val (int): Value corresponding to `subtree_dict`.
                - observed (float): Z-scores across 100 random resamples.
                - label (string): Key corresponding to `subtree_dict`.
                - dataset (string): Dataset label.
    """
    df_melt_list = []
    df_melt_100resamples_list = []
    df_true_melt_list = []
    df_null_zscores_i_c_melt_list = []
    df_null_zscores_i_c_melt_100resamples_list = []

    for index, dataset_name in enumerate(dataset_names):

        dfs_c = dfs_dataset_c.loc[dfs_dataset_c['dataset']==dataset_name]

        # slice out the triplets of the original trees
        df_true_slice = dfs_c.loc[:,'observed']

        # dataframe of original trees
        data = {'subtree_val': df_true_slice.index,
                'observed': df_true_slice.values}
        df_true_melt = pd.DataFrame(data)

        # slice out the triplets of the original trees
        expected = dfs_c.loc[:,'expected'].values

        # dataframe of resampled trees
        resamples = num_resamples - 1
        df_melt = pd.melt(dfs_c.loc[:,'0':f'{resamples}'].transpose(), var_name='subtree_val', value_name='observed')
        df_melt_100resamples = pd.melt(dfs_c.loc[:,'0':'99'].transpose(), var_name='subtree_val', value_name='observed')

        df_melt_list.append(df_melt)
        df_melt_100resamples_list.append(df_melt_100resamples)

        # calculate zscores
        zscores = []
        for i in df_true_slice.index:
            actual = df_true_slice[i]
            mean = np.mean(df_melt.loc[df_melt['subtree_val']==i]['observed'].values)
            std = np.std(df_melt.loc[df_melt['subtree_val']==i]['observed'].values)
            if std == 0:
                zscore = 0
            else:
                zscore = (actual - mean) / std
            zscores.append(zscore)

        # assign to dataframe and subset based on subtrees with top 10 significance values
        df_true_melt['expected'] = expected
        df_true_melt['z-score'] = zscores
        df_true_melt['abs z-score'] = abs(df_true_melt['z-score'])
        df_true_melt.fillna(0, inplace=True)
        df_true_melt.sort_values('abs z-score', axis=0, ascending=False, inplace=True)
        df_true_melt['label'] = [list(subtree_dict.keys())[i] for i in df_true_melt['subtree_val'].values]
        df_true_melt['null min'] = [df_melt.groupby(['subtree_val']).min(numeric_only=True).loc[i].values[0] for i in df_true_melt['subtree_val']]
        df_true_melt['null mean'] = [df_melt.groupby(['subtree_val']).mean(numeric_only=True).loc[i].values[0] for i in df_true_melt['subtree_val']]
        df_true_melt['null max'] = [df_melt.groupby(['subtree_val']).max(numeric_only=True).loc[i].values[0] for i in df_true_melt['subtree_val']]

        # calculate p-value (one-sided test)
        p_val_list = []
        for i, j in zip(df_true_melt['subtree_val'].values, df_true_melt['z-score'].values):
            resamples = dfs_c.iloc[i].values[:-1]
            actual = df_true_melt.loc[df_true_melt['subtree_val']==i]['observed'].values[0]
            if j > 0:
                pos = sum(resamples>=actual)
            elif j < 0:
                pos = sum(resamples<=actual)
            elif j == 0:
                pos=len(resamples)
            p_val = pos/len(resamples)
            p_val_list.append(p_val)
        df_true_melt['p_val'] = p_val_list
        df_true_melt['adj_p_val_fdr_bh'] = multipletests(p_val_list, method='fdr_bh')[1]
        df_true_melt['adj_p_val_fdr_tsbh'] = multipletests(p_val_list, method='fdr_tsbh')[1]
        df_true_melt['dataset'] = dataset_names[index]
        df_true_melt_list.append(df_true_melt)

        # calculate deviation of each resample
        df_null_zscores_i_list = []
        for i in tqdm(range(num_null)):
            df_true_slice_i = dfs_c[f'{i}'].copy()
            data = {'subtree_val': df_true_slice_i.index,
                    'observed': df_true_slice_i.values}
            df_true_melt_i = pd.DataFrame(data)

            df_subset_i = dfs_c[dfs_c.columns[~dfs_c.columns.isin([f'{i}', 'observed', 'expected', 'dataset'])]].copy()
            df_melt_i = pd.melt(df_subset_i.transpose(), var_name='subtree_val', value_name='observed')

            zscores_i = []
            for j in df_true_slice_i.index:
                actual = df_true_slice_i[j]
                mean = np.mean(df_melt_i.loc[df_melt_i['subtree_val']==j]['observed'].values)
                std = np.std(df_melt_i.loc[df_melt_i['subtree_val']==j]['observed'].values)
                if std == 0:
                    zscore = 0
                else:
                    zscore = (actual - mean) / std
                zscores_i.append(zscore)

            df_null_zscores_i = pd.DataFrame(zscores_i, columns=[i])
            df_null_zscores_i_list.append(df_null_zscores_i)

        df_null_zscores_i_c = pd.concat(df_null_zscores_i_list, axis=1)
        df_null_zscores_i_c.fillna(0, inplace=True)

        df_null_zscores_i_c_melt = df_null_zscores_i_c.transpose().melt(var_name='subtree_val', value_name='observed')
        df_null_zscores_i_c_melt_100resamples = df_null_zscores_i_c.loc[:,:99].transpose().melt(var_name='subtree_val', value_name='observed')

        df_null_zscores_i_c_melt_list.append(df_null_zscores_i_c_melt)
        df_null_zscores_i_c_melt_100resamples_list.append(df_null_zscores_i_c_melt_100resamples)

        df_true_melt['null z-score min'] = [df_null_zscores_i_c_melt.groupby(['subtree_val']).min(numeric_only=True).loc[i].values[0] for i in df_true_melt['subtree_val']]
        df_true_melt['null z-score mean'] = [df_null_zscores_i_c_melt.groupby(['subtree_val']).mean(numeric_only=True).loc[i].values[0] for i in df_true_melt['subtree_val']]
        df_true_melt['null z-score max'] = [df_null_zscores_i_c_melt.groupby(['subtree_val']).max(numeric_only=True).loc[i].values[0] for i in df_true_melt['subtree_val']]

    df_true_melt_c = pd.concat(df_true_melt_list)

    # Loop through each subtree and take the highest absolute z-score across all datasets
    df_true_melt_c_label_list = []
    for i in subtree_dict.keys():
        df_true_melt_c_label = df_true_melt_c.loc[df_true_melt_c['label']==i].copy()
        if len(df_true_melt_c_label) == 0:
            continue
        df_true_melt_c_label.sort_values('abs z-score', axis=0, ascending=False, inplace=True)
        df_true_melt_c_label = df_true_melt_c_label.iloc[[0]].copy()
        df_true_melt_c_label_list.append(df_true_melt_c_label)

    df_true_melt_c_label_c = pd.concat(df_true_melt_c_label_list)
    df_true_melt_c_label_c.sort_values('abs z-score', axis=0, ascending=False, inplace=True)

    # Subset based on the cutoff number of subtrees
    if cutoff == 'auto':
        cutoff = (df_true_melt_c_label_c['abs z-score'].values>1).sum()
        df_true_melt_c_label_c_subset = df_true_melt_c_label_c.iloc[:cutoff].copy()
    elif cutoff == None:
        df_true_melt_c_label_c_subset = df_true_melt_c_label_c.copy()
    else:
        df_true_melt_c_label_c_subset = df_true_melt_c_label_c.iloc[:cutoff].copy()

    df_true_melt_c_label_c_subset.sort_values('z-score', axis=0, ascending=False, inplace=True)

    # Subset the z-score DataFrame based on the cutoff number of subtrees in the DataFrames for each dataset
    df_true_melt_dataset_label_c_list = []
    for dataset in dataset_names:
        df_true_melt_dataset = df_true_melt_c.loc[df_true_melt_c['dataset']==dataset]
        df_true_melt_dataset_label_list = []
        for i in df_true_melt_c_label_c_subset['subtree_val']:
            df_true_melt_dataset_label = df_true_melt_dataset.loc[df_true_melt_dataset['subtree_val']==i]
            df_true_melt_dataset_label_list.append(df_true_melt_dataset_label)
        df_true_melt_dataset_label_c = pd.concat(df_true_melt_dataset_label_list)
        df_true_melt_dataset_label_c_list.append(df_true_melt_dataset_label_c)
    df_true_melt_dataset_label_c_c = pd.concat(df_true_melt_dataset_label_c_list)

    # Subset the melted DataFrames based on the cutoff number of subtrees in the DataFrames for each dataset
    df_melt_subset_c_list = []
    df_melt_100resamples_subset_c_list = []
    df_null_zscores_i_c_melt_subset_c_list = []
    df_null_zscores_i_c_melt_100resamples_subset_c_list = []
    for index, (df_melt, 
                df_melt_100resamples, 
                df_null_zscores_i_c_melt, 
                df_null_zscores_i_c_melt_100resamples) in enumerate(zip(df_melt_list, 
                                                                        df_melt_100resamples_list,
                                                                        df_null_zscores_i_c_melt_list, 
                                                                        df_null_zscores_i_c_melt_100resamples_list)):
        df_melt_subset_list = []
        for i in df_true_melt_c_label_c_subset['subtree_val']:
            df_melt_subtree = df_melt.loc[df_melt['subtree_val']==i].copy()
            df_melt_subtree['label']=list(subtree_dict.keys())[i]
            df_melt_subset_list.append(df_melt_subtree)
        df_melt_subset_c = pd.concat(df_melt_subset_list)
        df_melt_subset_c['dataset'] = dataset_names[index]
        df_melt_subset_c_list.append(df_melt_subset_c)

        df_melt_100resamples_subset_list = []
        for i in df_true_melt_c_label_c_subset['subtree_val']:
            df_melt_100resamples_subtree = df_melt_100resamples.loc[df_melt_100resamples['subtree_val']==i].copy()
            df_melt_100resamples_subtree['label']=list(subtree_dict.keys())[i]
            df_melt_100resamples_subset_list.append(df_melt_100resamples_subtree)
        df_melt_100resamples_subset_c = pd.concat(df_melt_100resamples_subset_list)
        df_melt_100resamples_subset_c['dataset'] = dataset_names[index]
        df_melt_100resamples_subset_c_list.append(df_melt_100resamples_subset_c)

        df_null_zscores_i_c_melt_subset_list = []
        for i in df_true_melt_c_label_c_subset['subtree_val']:
            df_null_zscores_i_c_melt_subtree = df_null_zscores_i_c_melt.loc[df_null_zscores_i_c_melt['subtree_val']==i].copy()
            df_null_zscores_i_c_melt_subtree['label']=list(subtree_dict.keys())[i]
            df_null_zscores_i_c_melt_subset_list.append(df_null_zscores_i_c_melt_subtree)
        df_null_zscores_i_c_melt_subset_c = pd.concat(df_null_zscores_i_c_melt_subset_list)
        df_null_zscores_i_c_melt_subset_c['dataset'] = dataset_names[index]
        df_null_zscores_i_c_melt_subset_c_list.append(df_null_zscores_i_c_melt_subset_c)

        df_null_zscores_i_c_melt_100resamples_subset_list = []
        for i in df_true_melt_c_label_c_subset['subtree_val']:
            df_null_zscores_i_c_melt_100resamples_subtree = df_null_zscores_i_c_melt_100resamples.loc[df_null_zscores_i_c_melt_100resamples['subtree_val']==i].copy()
            df_null_zscores_i_c_melt_100resamples_subtree['label']=list(subtree_dict.keys())[i]
            df_null_zscores_i_c_melt_100resamples_subset_list.append(df_null_zscores_i_c_melt_100resamples_subtree)
        df_null_zscores_i_c_melt_100resamples_subset_c = pd.concat(df_null_zscores_i_c_melt_100resamples_subset_list)
        df_null_zscores_i_c_melt_100resamples_subset_c['dataset'] = dataset_names[index]
        df_null_zscores_i_c_melt_100resamples_subset_c_list.append(df_null_zscores_i_c_melt_100resamples_subset_c)

    df_melt_subset_c_c = pd.concat(df_melt_subset_c_list)
    df_melt_100resamples_subset_c_c = pd.concat(df_melt_100resamples_subset_c_list)
    df_null_zscores_i_c_melt_subset_c_c = pd.concat(df_null_zscores_i_c_melt_subset_c_list)
    df_null_zscores_i_c_melt_100resamples_subset_c_c = pd.concat(df_null_zscores_i_c_melt_100resamples_subset_c_list)

    return (df_true_melt_dataset_label_c_c,
            df_melt_subset_c_c, 
            df_melt_100resamples_subset_c_c,
            df_null_zscores_i_c_melt_subset_c_c,
            df_null_zscores_i_c_melt_100resamples_subset_c_c)

multi_dataset_plot_deviation(subtree, dataset_names, df_true_melt_dataset_label_c_c, dataset_color_dict, cell_color_dict, cutoff='auto', title='auto', legend_bool=True, legend_pos='outside', save=False, image_format='png', dpi=300, image_save_path=None)

Plots deviation of cutoff number of subtrees in multiple datasets.

Parameters:

Name Type Description Default
subtree string

Type of subtree.

required
dataset_names list

List where each entry is a string representing the dataset label.

required
df_true_melt_dataset_label_c_c DataFrame

DataFrame with cutoff number of most significant subtrees for plotting. Sorted by z-score from most over-represented to most under-represented. Output from multi_dataset_dfs_for_plotting function.

required
dataset_color_dict dict

Keys are dataset names, values are colors.

required
cell_color_dict dict

Keys are cell fates, values are colors.

required
cutoff string or NoneType or int

Take cutoff number of subtrees with largest absolute z-scores to include in plots. If not provided explicitly, will be automatically determined to take all subtrees with abs z-score > 1. If NoneType, take all subtrees.

'auto'
title string

Title to use for plot. If not provided explicitly, will be automatically determined to read subtree frequency.

'auto'
legend_bool bool

Include legend in plot.

True
legend_pos string

Position of legend (outside or inside).

'outside'
save bool

If True, save figure as file.

False
image format (string

Format of image file to be saved (png or svg).

required
dpi int

Resolution of saved image file.

300
image_save_path string

Path to saved image file.

None
Source code in linmo/plot.py
def multi_dataset_plot_deviation(subtree, 
                                 dataset_names,
                                 df_true_melt_dataset_label_c_c, 
                                 dataset_color_dict,
                                 cell_color_dict,
                                 cutoff='auto',
                                 title='auto',
                                 legend_bool=True,
                                 legend_pos='outside',
                                 save=False, 
                                 image_format='png',
                                 dpi=300,
                                 image_save_path=None):

    """Plots deviation of `cutoff` number of subtrees in multiple datasets.

    Args:
        subtree (string): Type of subtree.
        dataset_names (list): List where each entry is a string representing the dataset label. 
        df_true_melt_dataset_label_c_c (DataFrame): DataFrame with cutoff number of most significant subtrees for plotting.
            Sorted by z-score from most over-represented to most under-represented.
            Output from `multi_dataset_dfs_for_plotting` function.
        dataset_color_dict (dict): Keys are dataset names, values are colors.
        cell_color_dict (dict): Keys are cell fates, values are colors.
        cutoff (string or NoneType or int, optional): Take `cutoff` number of subtrees with largest absolute z-scores 
            to include in plots.
            If not provided explicitly, will be automatically determined to take all subtrees with abs z-score > 1.
            If NoneType, take all subtrees.
        title (string, optional): Title to use for plot. If not provided explicitly, will be automatically determined to read `subtree` frequency.
        legend_bool (bool, optional): Include legend in plot.
        legend_pos (string, optional): Position of legend (outside or inside).
        save (bool, optional): If True, save figure as file.
        image format (string, optional): Format of image file to be saved (png or svg).
        dpi (int, optional): Resolution of saved image file.
        image_save_path (string, optional): Path to saved image file.
    """

    margins=0.05
    bbox_to_anchor=(0, 0)  
    figsize=(0.23*len(df_true_melt_dataset_label_c_c)/len(dataset_names)+margins, 2.5)

    sns.set_style('whitegrid')
    fig, ax = pyplot.subplots(figsize=figsize)
    pyplot.setp(ax.collections)

    pyplot.axhline(y=0, color='gray', linestyle='-', label='No deviation', zorder=1)

    for i, dataset in enumerate(dataset_names):
        i+=1
        pyplot.scatter(x="label", y="z-score", data=df_true_melt_dataset_label_c_c.loc[df_true_melt_dataset_label_c_c['dataset']==dataset], color=dataset_color_dict[dataset], label=f'{dataset}', s=10, zorder=i*5)
        pyplot.plot(df_true_melt_dataset_label_c_c.loc[df_true_melt_dataset_label_c_c['dataset']==dataset]['label'], df_true_melt_dataset_label_c_c.loc[df_true_melt_dataset_label_c_c['dataset']==dataset]['z-score'], color=dataset_color_dict[dataset], linewidth=0.75, zorder=1)

    pyplot.margins(x=0.05, y=0.15)
    pyplot.grid(True)
    ax.set_xticklabels([])

    if title == 'auto':
        pyplot.title('Deviation from resamples', y=1.02, **{'fontname':'Arial', 'size':8})#, fontweight='bold')
    else:
        pyplot.title(f'{title}', y=1.02, **{'fontname':'Arial', 'size':8})#, fontweight='bold')
    pyplot.ylabel('z-score', **{'fontname':'Arial', 'size':8})
    pyplot.yticks(**{'fontname':'Arial', 'size':8})

    if legend_bool == True:
        legend_props = font_manager.FontProperties(family='Arial', style='normal', size=6)
        if legend_pos == 'outside':
            pyplot.legend(loc='upper left', framealpha=1, prop=legend_props, bbox_to_anchor=(1.05,1.0))
        elif legend_pos == 'inside':
            pyplot.legend(loc='upper right', framealpha=1, prop=legend_props)
    for i, artist in enumerate(ax.findobj(PathCollection)):
        artist.set_zorder(1)

    for subtree_label in df_true_melt_dataset_label_c_c.loc[df_true_melt_dataset_label_c_c['dataset']==dataset_names[0]]['label'].values:
        _make_annotation(cell_color_dict, ax, subtree_label, subtree)

    labelpad = df_annotations.loc[df_annotations['subtree_type']==subtree]['labelpad'].values[0]    

    if cutoff==None:
        pyplot.xlabel(f'All {subtree} combinations', labelpad=labelpad, **{'fontname':'Arial', 'size':8})
    else:
        pyplot.xlabel(f'{subtree.capitalize()} combinations \n(top {int(len(df_true_melt_dataset_label_c_c)/len(dataset_names))} by abs z-score)', labelpad=labelpad, **{'fontname':'Arial', 'size':8})

    if save==True:
        pyplot.savefig(f"{image_save_path}.{image_format}", dpi=dpi, bbox_inches="tight")

plot_deviation(subtree, df_true_melt_subset, df_null_zscores_i_c_melt_subset, df_null_zscores_i_c_melt_100resamples_subset, cell_color_dict, fdr_type='fdr_tsbh', cutoff='auto', title='auto', multiple_datasets=False, legend_bool=True, legend_pos='outside', save=False, image_format='png', dpi=300, image_save_path=None)

Plots deviation of cutoff number of subtrees in original dataset and num_null resamples.

Parameters:

Name Type Description Default
subtree string

Type of subtree.

required
df_true_melt_subset DataFrame

DataFrame with cutoff number of most significant subtrees for plotting. Sorted by z-score from most over-represented to most under-represented. Output from dfs_for_plotting function.

required
df_null_zscores_i_c_melt_subset DataFrame

Melted DataFrame with null z-score for cutoff number of most significant subtrees across num_null random resamples. Output from dfs_for_plotting function.

required
df_null_zscores_i_c_melt_100resamples_subset DataFrame

Melted DataFrame with null z-score for cutoff number of most significant subtrees across 100 random resamples. Output from dfs_for_plotting function.

required
cell_color_dict dict

Keys are cell fates, values are colors.

required
fdr_type string

Use the Benjamini and Hochberg FDR correction if 'fdr_bh', use Benjamini and Hochberg FDR correction with two stage linear step-up procedure if 'fdr_tsbh'. Uses 'fdr_tsbh' by default.

'fdr_tsbh'
cutoff string or NoneType or int

Take cutoff number of subtrees with largest absolute z-scores to include in plots. If not provided explicitly, will be automatically determined to take all subtrees with abs z-score > 1. If NoneType, take all subtrees.

'auto'
title string

Title to use for plot. If not provided explicitly, will be automatically determined to read subtree frequency.

'auto'
multiple_datasets bool

Modify x-axis label depending if single or multiple datasets were used.

False
legend_bool bool

Include legend in plot.

True
legend_pos string

Position of legend (outside or inside).

'outside'
save bool

If True, save figure as file.

False
image format (string

Format of image file to be saved (png or svg).

required
dpi int

Resolution of saved image file.

300
image_save_path string

Path to saved image file.

None
Source code in linmo/plot.py
def plot_deviation(subtree, 
                   df_true_melt_subset, 
                   df_null_zscores_i_c_melt_subset, 
                   df_null_zscores_i_c_melt_100resamples_subset, 
                   cell_color_dict,
                   fdr_type='fdr_tsbh',
                   cutoff='auto', 
                   title='auto',
                   multiple_datasets=False,
                   legend_bool=True,
                   legend_pos='outside',
                   save=False, 
                   image_format='png',
                   dpi=300,
                   image_save_path=None):

    """Plots deviation of `cutoff` number of subtrees in original dataset and `num_null` resamples.

    Args:
        subtree (string): Type of subtree.
        df_true_melt_subset (DataFrame): DataFrame with cutoff number of most significant subtrees for plotting.
            Sorted by z-score from most over-represented to most under-represented.
            Output from `dfs_for_plotting` function.
        df_null_zscores_i_c_melt_subset (DataFrame): Melted DataFrame with null z-score for `cutoff` number of most significant
            subtrees across `num_null` random resamples.
            Output from `dfs_for_plotting` function.
        df_null_zscores_i_c_melt_100resamples_subset (DataFrame): Melted DataFrame with null z-score for `cutoff` number of 
            most significant subtrees across 100 random resamples.
            Output from `dfs_for_plotting` function.
        cell_color_dict (dict): Keys are cell fates, values are colors.
        fdr_type (string, optional): Use the Benjamini and Hochberg FDR correction if 'fdr_bh', use Benjamini and Hochberg FDR correction
            with two stage linear step-up procedure if 'fdr_tsbh'. Uses 'fdr_tsbh' by default.
        cutoff (string or NoneType or int, optional): Take `cutoff` number of subtrees with largest absolute z-scores 
            to include in plots.
            If not provided explicitly, will be automatically determined to take all subtrees with abs z-score > 1.
            If NoneType, take all subtrees.
        title (string, optional): Title to use for plot. If not provided explicitly, will be automatically determined to read `subtree` frequency.
        multiple_datasets (bool, optional): Modify x-axis label depending if single or multiple datasets were used.
        legend_bool (bool, optional): Include legend in plot.
        legend_pos (string, optional): Position of legend (outside or inside).
        save (bool, optional): If True, save figure as file.
        image format (string, optional): Format of image file to be saved (png or svg).
        dpi (int, optional): Resolution of saved image file.
        image_save_path (string, optional): Path to saved image file.
    """

    df_true_melt_subset_sg = df_true_melt_subset.loc[df_true_melt_subset[f'adj_p_val_{fdr_type}']<0.05].copy()

    margins=0.05
    bbox_to_anchor=(0, 0)  
    figsize=(0.23*len(df_true_melt_subset)+margins, 2.5)

    sns.set_style('whitegrid')
    fig, ax = pyplot.subplots(figsize=figsize)
    pyplot.setp(ax.collections)

    sns.violinplot(x='label', 
                   y='observed', 
                   data=df_null_zscores_i_c_melt_subset, 
                   cut=0,
                   inner=None,
                   color='#BCBEC0',
                   scale='width',
                   linewidth=0,
                   )
    sns.stripplot(x='label', 
                  y='observed', 
                  data=df_null_zscores_i_c_melt_100resamples_subset, 
                  jitter=0.2,
                  color='gray',
                  size=0.5,
                 )
    pyplot.scatter(x="label", y="z-score", data=df_true_melt_subset, color='red', label='Observed count', s=2.5)
    pyplot.scatter(x="label", y="null z-score mean", data=df_true_melt_subset, color='gray', label=f'Null z-score across resamples', s=2.5)
    pyplot.scatter(x="label", y="null z-score mean", data=df_true_melt_subset, color='black', label=f'Average null z-score', s=2.5)
    pyplot.scatter(x="label", y="null z-score min", data=df_true_melt_subset, color='gray', s=0, label='')
    pyplot.scatter(x="label", y="null z-score max", data=df_true_melt_subset, color='gray', s=0, label='')
    pyplot.scatter(x="label", y="z-score", data=df_true_melt_subset, color='red', label='', s=2.5)
    #pyplot.scatter(x="label", y="z-score", data=df_true_melt_subset_sg, color='red', s=25, alpha=0.35, label='Adjusted p-value < 0.05')

    # add annotations for adjusted p-value
    for label in df_true_melt_subset_sg['label'].values:
        adj_p_val = df_true_melt_subset_sg.loc[df_true_melt_subset_sg['label']==label][f'adj_p_val_{fdr_type}'].values[0]
        val = df_true_melt_subset_sg.loc[df_true_melt_subset_sg['label']==label]['z-score'].values[0]
        null = df_true_melt_subset_sg.loc[df_true_melt_subset_sg['label']==label]['null z-score mean'].values[0]
        if val > null:
            y_coord = val+max(df_true_melt_subset['z-score'])/10
            pyplot.annotate(_annot(adj_p_val), xy=(label, y_coord), ha='center', va='bottom', **{'fontname':'Arial', 'size':8})
        else:
            y_coord = val-max(df_true_melt_subset['z-score'])/10
            pyplot.annotate(_annot(adj_p_val), xy=(label, y_coord), ha='center', va='top', **{'fontname':'Arial', 'size':8})


    pyplot.margins(x=0.05, y=0.15)
    pyplot.grid(True)
    ax.set_xticklabels([])

    if title == 'auto':
        pyplot.title('Deviation from resamples', y=1.02, **{'fontname':'Arial', 'size':8})#, fontweight='bold')
    else:
        pyplot.title(f'{title}', y=1.02, **{'fontname':'Arial', 'size':8})#, fontweight='bold')
    pyplot.ylabel('z-score', **{'fontname':'Arial', 'size':8})
    pyplot.yticks(**{'fontname':'Arial', 'size':8})

    if legend_bool == True:
        legend_props = font_manager.FontProperties(family='Arial', style='normal', size=6)
        if legend_pos == 'outside':
            pyplot.legend(loc='upper left', framealpha=1, prop=legend_props, bbox_to_anchor=(1.05,1.0))
        elif legend_pos == 'inside':
            pyplot.legend(loc='upper right', framealpha=1, prop=legend_props)
    for i, artist in enumerate(ax.findobj(PathCollection)):
        artist.set_zorder(1)

    for subtree_label in df_true_melt_subset['label'].values:
        _make_annotation(cell_color_dict, ax, subtree_label, subtree)

    labelpad = df_annotations.loc[df_annotations['subtree_type']==subtree]['labelpad'].values[0]    

    if cutoff==None:
        pyplot.xlabel(f'All {subtree} combinations', labelpad=labelpad, **{'fontname':'Arial', 'size':8})
    else:
        if multiple_datasets == False:
            pyplot.xlabel(f'{subtree.capitalize()} combinations \n(top {len(df_true_melt_subset)} by abs z-score)', labelpad=labelpad, **{'fontname':'Arial', 'size':8})
        else:
            pyplot.xlabel(f'{subtree.capitalize()} combinations \n(top {len(df_true_melt_subset)} by abs z-score across all datasets)', labelpad=labelpad, **{'fontname':'Arial', 'size':8})

    if save==True:
        pyplot.savefig(f"{image_save_path}.{image_format}", dpi=dpi, bbox_inches="tight")

plot_frequency(subtree, df_true_melt_subset, df_melt_subset, df_melt_100resamples_subset, cell_color_dict, use_expected=True, fdr_type='fdr_tsbh', cutoff='auto', title='auto', multiple_datasets=False, legend_bool=True, legend_pos='outside', save=False, image_format='png', dpi=300, image_save_path=None)

Plots frequency of cutoff number of subtrees in original dataset and all resamples.

Parameters:

Name Type Description Default
subtree string

Type of subtree.

required
df_true_melt_subset DataFrame

DataFrame with cutoff number of most significant subtrees for plotting. Sorted by z-score from most over-represented to most under-represented. Output from dfs_for_plotting function.

required
df_melt_subset DataFrame

Melted DataFrame with observed count for cutoff number of most significant subtrees across all resamples. Output from dfs_for_plotting function.

required
df_melt_100resamples_subset DataFrame

Melted DataFrame with observed count for cutoff number of most significant subtrees across 100 random resamples. Output from dfs_for_plotting function.

required
cell_color_dict dict

Keys are cell fates, values are colors.

required
use_expected Boolean

Use expected count in DataFrame.

True
fdr_type string

Use the Benjamini and Hochberg FDR correction if 'fdr_bh', use Benjamini and Hochberg FDR correction with two stage linear step-up procedure if 'fdr_tsbh'. Uses 'fdr_tsbh' by default.

'fdr_tsbh'
cutoff string or NoneType or int

Take cutoff number of subtrees with largest absolute z-scores to include in plots. If not provided explicitly, will be automatically determined to take all subtrees with abs z-score > 1. If NoneType, take all subtrees.

'auto'
title string

Title to use for plot. If not provided explicitly, will be automatically determined to read subtree frequency.

'auto'
multiple_datasets bool

Modify x-axis label depending if single or multiple datasets were used.

False
legend_bool bool

Include legend in plot.

True
legend_pos string

Position of legend (outside or inside).

'outside'
save bool

If True, save figure as file.

False
image format (string

Format of image file to be saved (png or svg).

required
dpi int

Resolution of saved image file.

300
image_save_path string

Path to saved image file.

None
Source code in linmo/plot.py
def plot_frequency(subtree, 
                   df_true_melt_subset, 
                   df_melt_subset, 
                   df_melt_100resamples_subset, 
                   cell_color_dict,
                   use_expected=True,
                   fdr_type='fdr_tsbh',
                   cutoff='auto', 
                   title='auto',
                   multiple_datasets=False,
                   legend_bool=True, 
                   legend_pos='outside',
                   save=False, 
                   image_format='png',
                   dpi=300,
                   image_save_path=None):

    """Plots frequency of `cutoff` number of subtrees in original dataset and all resamples.

    Args:
        subtree (string): Type of subtree.
        df_true_melt_subset (DataFrame): DataFrame with `cutoff` number of most significant subtrees for plotting.
            Sorted by z-score from most over-represented to most under-represented.
            Output from `dfs_for_plotting` function.
        df_melt_subset (DataFrame): Melted DataFrame with observed count for `cutoff` number of most significant subtrees 
            across all resamples.
            Output from `dfs_for_plotting` function.
        df_melt_100resamples_subset (DataFrame): Melted DataFrame with observed count for `cutoff` number of most significant
            subtrees across 100 random resamples.
            Output from `dfs_for_plotting` function.
        cell_color_dict (dict): Keys are cell fates, values are colors.
        use_expected (Boolean): Use expected count in DataFrame.
        fdr_type (string, optional): Use the Benjamini and Hochberg FDR correction if 'fdr_bh', use Benjamini and Hochberg FDR correction
            with two stage linear step-up procedure if 'fdr_tsbh'. Uses 'fdr_tsbh' by default.
        cutoff (string or NoneType or int, optional): Take `cutoff` number of subtrees with largest absolute z-scores 
            to include in plots.
            If not provided explicitly, will be automatically determined to take all subtrees with abs z-score > 1.
            If NoneType, take all subtrees.
        title (string, optional): Title to use for plot. If not provided explicitly, will be automatically determined to read `subtree` frequency.
        multiple_datasets (bool, optional): Modify x-axis label depending if single or multiple datasets were used.
        legend_bool (bool, optional): Include legend in plot.
        legend_pos (string, optional): Position of legend (outside or inside).
        save (bool, optional): If True, save figure as file.
        image format (string, optional): Format of image file to be saved (png or svg).
        dpi (int, optional): Resolution of saved image file.
        image_save_path (string, optional): Path to saved image file.
    """

    df_true_melt_subset_sg = df_true_melt_subset.loc[df_true_melt_subset[f'adj_p_val_{fdr_type}']<0.05].copy()

    margins=0.05
    bbox_to_anchor=(0, 0)  
    figsize=(0.23*len(df_true_melt_subset)+margins, 2.5)

    sns.set_style('whitegrid')
    fig, ax = pyplot.subplots(figsize=figsize)
    pyplot.setp(ax.collections)

    sns.violinplot(x='label', 
                   y='observed', 
                   data=df_melt_subset, 
                   cut=0,
                   inner=None,
                   color='#BCBEC0',
                   scale='width',
                   linewidth=0,
                   )
    sns.stripplot(x='label', 
                  y='observed', 
                  data=df_melt_100resamples_subset, 
                  jitter=0.2,
                  color='gray',
                  size=0.5,
                 )
    pyplot.scatter(x='label', y='observed', data=df_true_melt_subset, color='red', label='Observed count', s=2.5)
    pyplot.scatter(x='label', y='null mean', data=df_true_melt_subset, color='gray', label='Count across all resamples', s=2.5)
    if use_expected == True:
        pyplot.scatter(x='label', y='expected', data=df_true_melt_subset, color='black', label='Expected count', s=2.5)
    pyplot.scatter(x='label', y='null min', data=df_true_melt_subset, color='gray', s=0, label='')
    pyplot.scatter(x='label', y='null max', data=df_true_melt_subset, color='gray', s=0, label='')
    pyplot.scatter(x='label', y='observed', data=df_true_melt_subset, color='red', label='', s=2.5)
    #pyplot.scatter(x='label', y='observed', data=df_true_melt_subset_sg, color='red', s=25, alpha=0.35, label='Adjusted p-value < 0.05')

    # add annotations for adjusted p-value
    for label in df_true_melt_subset_sg['label'].values:
        adj_p_val = df_true_melt_subset_sg.loc[df_true_melt_subset_sg['label']==label][f'adj_p_val_{fdr_type}'].values[0]
        val = df_true_melt_subset_sg.loc[df_true_melt_subset_sg['label']==label]['observed'].values[0]
        null = df_true_melt_subset_sg.loc[df_true_melt_subset_sg['label']==label]['null mean'].values[0]
        if val > null:
            y_coord = val+max(df_true_melt_subset['observed'])/10
            pyplot.annotate(_annot(adj_p_val), xy=(label, y_coord), ha='center', va='bottom', **{'fontname':'Arial', 'size':8})
        else:
            y_coord = val-max(df_true_melt_subset['observed'])/10
            pyplot.annotate(_annot(adj_p_val), xy=(label, y_coord), ha='center', va='top', **{'fontname':'Arial', 'size':8})

    pyplot.margins(x=0.05, y=0.15)
    pyplot.grid(True)
    ax.set_xticklabels([])

    if title == 'auto':
        pyplot.title(f'{subtree.capitalize()} frequency', y=1.02, **{'fontname':'Arial', 'size':8})#, fontweight='bold')
    else:
        pyplot.title(f'{title}', y=1.02, **{'fontname':'Arial', 'size':8})#, fontweight='bold')
    pyplot.ylabel('Counts', **{'fontname':'Arial', 'size':8})
    pyplot.yticks(**{'fontname':'Arial', 'size':8})

    if legend_bool == True:
        legend_props = font_manager.FontProperties(family='Arial', style='normal', size=6)
        if legend_pos == 'outside':
            pyplot.legend(loc='upper left', framealpha=1, prop=legend_props, bbox_to_anchor=(1.05,1.0))
        elif legend_pos == 'inside':
            pyplot.legend(loc='upper right', framealpha=1, prop=legend_props)

    for i, artist in enumerate(ax.findobj(PathCollection)):
        artist.set_zorder(1)

    for subtree_label in df_true_melt_subset['label'].values:
        _make_annotation(cell_color_dict, ax, subtree_label, subtree)

    labelpad = df_annotations.loc[df_annotations['subtree_type']==subtree]['labelpad'].values[0]    

    if cutoff==None:
        pyplot.xlabel(f'All {subtree} combinations', labelpad=labelpad, **{'fontname':'Arial', 'size':8})
    else:
        if multiple_datasets == False:
            pyplot.xlabel(f'{subtree.capitalize()} combinations \n(top {len(df_true_melt_subset)} by abs z-score)', labelpad=labelpad, **{'fontname':'Arial', 'size':8})
        else:
            pyplot.xlabel(f'{subtree.capitalize()} combinations \n(top {len(df_true_melt_subset)} by abs z-score across all datasets)', labelpad=labelpad, **{'fontname':'Arial', 'size':8})

    if save==True:
        pyplot.savefig(f"{image_save_path}.{image_format}", dpi=dpi, bbox_inches="tight")

linmo.simulate

Provides functions for simulating lineage trees.

This module contains the following functions: - simulate_tree - Simulate tree based on provided transition matrix and progenitor/cell type labels.

simulate_tree(transition_matrix, starting_progenitor, labels)

Simulate tree based on provided transition matrix and progenitor/cell type labels. Progenitors are represented by lowercase letters, cell types are represented by uppercase letters.

Parameters:

Name Type Description Default
transition_matrix array

matrix where rows represent original state, column represents state to transition into. Rows should sum to 1 for progenitors.

required
starting_progenitor string

string with starting progenitor

required
labels string

string with progenitor/cell type labels that correspond to the rows of the provided transition matrix.

required

Returns:

Name Type Description
tree_input string

new tree in NEWICK format after simulated division until no progenitor cells remaining.

Source code in linmo/simulate.py
def simulate_tree(transition_matrix, starting_progenitor, labels):
    '''Simulate tree based on provided transition matrix and progenitor/cell type labels.
    Progenitors are represented by lowercase letters, cell types are represented by uppercase letters.

    Args:
        transition_matrix (array): matrix where rows represent original state, column represents state to transition into.
            Rows should sum to 1 for progenitors.
        starting_progenitor (string): string with starting progenitor
        labels (string): string with progenitor/cell type labels that correspond to the rows of the provided transition matrix.

    Returns:
        tree_input (string): new tree in NEWICK format after simulated division until no progenitor cells remaining.
    '''
    tree_input = starting_progenitor
    # continue dividing if tree contains progenitors
    while re.findall('[a-z]', tree_input) != []: 
        tree_input = _divide(tree_input, transition_matrix, labels)
    return tree_input