Motif Kernel

The Motif Kernel depends on a user defined set of motifs. For each input sequence the number of contained motifs (motif content) is computed. These values can then be used as a similarity measurement. Since the features of this kernel are restricted to a user defined set, the performance in later analysis of the kernel output (e.g. classification of cell populations) depends highly on the picked motifs. While this offers a lot of flexibility, the user should carefully decide which motifs are used. The package does not yet include any method to extract significant motifs from a set of sequences. There is, however, a large variety of motif extraction applications available [1] [2] [3]. Information on how exactly this kernel implementation works can be found in Modules.

A single motif is defined by a sequence of the following elements:

  • A single character from an alphabet e.g. [A (Adenine), G (Guanine), T (Thymine), C (Cytosine)]. This character only matches the exact same character.
  • The wildcard character “.” which can represent and therefore matches any of the characters from the alphabet.
  • A substitution group, which is a list of characters from the alphabet enclosed in square brackets e.g. [AG]. This group matches every character within the brackets (e.g. A or G). If the leading character is a “^” the substitution group matches any character but those in the brackets (e.g. C or T).

How to use the Motif Kernel

The collection of motifs has to be defined as a list of strings:

motif_collection = ["A[^CG]T", "C.G", "C..G.T", "G[A][AT]", "GT.A[CA].[CT]G"]

This collection can then be used to create the motif Kernel:

from strkernel.motifkernel import motifKernel
motif_kernel = motifKernel(motif_collection)

Afterwards a set of sequences can be passed to the motifKernel object which returns a sparse matrix. The sparse matrix contains the motif content for each sequence (default) or the similarities between the sequences based on the motif content (with return_kernel_matrix enabled). If the flanking characters (lower case letters) should not be included include_flanking has to be disabled. Flanking characters are enabled by default since it generally improves accuracy if the kernel output is used for classification. For large sets of sequences disabling include_flanking characters can give a significant performance boost:

sequences = ["ACGTCGATGC", "GTCGATAGC", "GCTAGCacgtaCGC","GTAGCTgtgcGTGcgt", "CGATAGCTAGTTAGC"]
#compute motif content matrix
motif_content = motif_kernel.compute_matrix(sequences)
#compute kernel matrix
kernel_matrix = motif_kernel.compute_matrix(sequences, return_kernel_matrix = True)
#compute motif content matrix without considering the flanking characters
matirx_without_flanking = motif_kernel.compute_matrix(sequences, include_flanking = False)

References

[1]Pissis, S.P., Stamatakis, A. and Pavlidis, P., 2013, September. MoTeX: A word-based HPC tool for MoTif eXtraction. In Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics (p. 13). ACM.
[2]Pissis, S.P., 2014. MoTeX-II: structured MoTif eXtraction from large-scale datasets. BMC bioinformatics, 15(1), p.235.
[3]Zhang, Y. and Zaki, M.J., 2006. EXMOTIF: efficient structured motif extraction. Algorithms for Molecular Biology, 1(1), p.21.

Modules

Motif Kernel Module.

class strkernel.motifkernel.motifKernel(motifs: [<class 'str'>])[source]

Bases: object

This is the main class used for the motif kernel construction. The idea is to construct a Trie from a set of Motifs and then use this Trie to compute the motif content of a sequnece. The motif content can then be used to compute the similarity between sequences and therefore is able to serve as input for machine learning algorithms.

Example: An elaborate example can be found in the Tutorials sections.

Standard Use:

motifKernel(motifs)
compute_matrix(sequences: [<class 'str'>], include_flanking: bool = True, return_kernel_matrix: bool = False)[source]

Computes the motif content of a set of sequences and returns a sparse matrix which can be used as input for machine learning approaches. The sparse matrix has only been tested with algorithms from the python package sklearn. Optional parameters allow the computation of a kernel matrix and inclusion of flanking regions.

Args:

sequences: A list of strings that are of the same alphabet as the motifs used to construct the motif Trie.

include_flanking: Option to include or disregard the flanking regions. Default is True.

return_kernel_matrix: A boolean value that indicates if the function should return a sparse matrix with the similarities between sequences (True) or a sparse matrix where each row contains the motif content of a sequence (False). Default is False.

Returns:
csr_matrix: A sparse matrix object containg either the kernel matrix (return_kernel_matrix = True) or the motif content of each sequence.

Motif Trie

Motif Trie Module

class strkernel.lib.motiftrie.MotifTrie(motifs: [<class 'str'>])[source]

Bases: object

The main objective of this class is to construct a Trie from a set of Motifs (strings). The Trie can then be used to compute the motif content of a sequence.

First, Each of the motifs which is passed to the constructer is converted into a Motif object. The Motif objects are then used for the Trie constrcution.

The Trie construction can be seperated into the following steps:

  1. The Trie is initialized with a root node (TrieNode object)
  2. Each Motif object is added to the Trie by parsing the Trie and adding the parts of the Motif which are not yet present.
  3. The final Node object of each Motif is marked.

If sequences are passed to check_for_motifs function a DFS is performend and a numpy array containing the motif content is returned.

add(motif: strkernel.lib.motif.Motif)[source]

Adds a motif to the Trie.

Args:
motif: A motif object that originates from a motif (string) passed to the constructer of the MotifTrie.
check_for_motifs(sequence: str) → numpy.array[source]

Iterates over the given sequence and returns the sum of the motif content of all subsequences.

Args:
sequence: A sequence (read) that only has characters that also appear in the alphabet of the motifs used to construct the MotifTrie.
Returns:
Numpy array containing the motif content of the sequence.
dfs(sequence: str, motifdict: dict) → [<class 'str'>][source]

Performs a depth first search on the input sequence and adds the motif content to the input dictionary.

Args:
sequence: A part of the sequence given to check_for_motifs. In the first iteration of check_for_motifs the complete sequence is passed to this function. motifdict: A dictionary where each entry refers to one of the motifs used to construct the MotifTrie.
Returns:
The motifdict with the motif content in this specific sequence.
class strkernel.lib.motiftrie.TrieNode(char: str)[source]

Bases: object

The TrieNode class consist of an element of a motif and its children in the Motif Trie. This class can be considerd a helper for the MotifTrie class.

Motif

Motif Module.

class strkernel.lib.motif.Motif(motif: str)[source]

Bases: object

The input string gets transformed into an object of the Motif class. The main attribute of this class is the motif itself which is given as a list of strings where each string is one of the following elements:

  1. A character of a common alphabet (e.g. {A,G,C,T}) -> this alphabet should be the same for all motifs
  2. The wildcard character “.”
  3. Substiution groups that contain 2 ore more characters of the alphabet (e.g. [AG] or [CT]). “^” as a leading character indicates that every character in the alphabet but those in the substitution group matches this part of the motif.
get_alphabet()[source]

Extracts the alphabet (unique characters) from a string.

process_motif()[source]

Processes the input motif which is a string and returns a list of strings.