evalne.utils package

Submodules

evalne.utils.preprocess module

evalne.utils.preprocess.get_redges_false(G, output_path=None)[source]

For directed graphs returns a list of all non-edges for which the opposite edge exists in the graph. E.g. returns all pairs of non-edges (a -> b) such that (b -> a) exists in the graph.

Parameters:
  • G (graph) – A NetworkX digraph.
  • output_path (string) – A path or file where to store the results.
Returns:

redges_false – A set of node pairs respecting the mentioned property.

Return type:

set

evalne.utils.preprocess.get_stats(G, output_path=None, all_stats=False)[source]

Prints or stores some basic statistics about the graph. If an output path is provided the results are written in said file.

Parameters:
  • G (graph) – A NetworkX graph or digraph.
  • output_path (file or string, optional) – File or filename to write. Default is None
  • all_stats (bool, optional) – Sets if all stats or a small subset of them should be shown. Computing all stats can be very slow. Default is False.
evalne.utils.preprocess.infer_header(input_path, expected_lines, method_name=None)[source]

Method that infers the length of the header of a given file from the number of lines and of expected lines.

Parameters:
  • input_path (file or string) – File or filename to read.
  • expected_lines (int) – Number of expected lines in the input file.
  • method_name (string, optional) – A string indicating the name of the method being evaluated. If provided will be used when logging an error. Default is None.
Returns:

header_len – The length of the header.

Return type:

int

Raises:

ValueError – If more lines are expected than those in the file.

evalne.utils.preprocess.load_graph(input_path, delimiter=', ', comments='#', directed=False, datatype=<class 'float'>)[source]

Reads a directed or undirected edgelist and returns a graph. If the edgelist is weighted the graph will maintain those weights. Nodes are read as int and weights as float.

Parameters:
  • input_path (file or string) – File or filename to read.
  • delimiter (string, optional) – The string used to separate values in the input file. Default is ‘,’.
  • comments (string, optional) – The character used to indicate the start of a comment. Default is ‘#’.
  • directed (bool) – Indicates if the method should return a graph or a digraph.
  • datatype (int or float, optional) – The type of the graph weights. Default is float.
Returns:

G – A NetworkX graph or digraph.

Return type:

graph

evalne.utils.preprocess.prep_graph(G, relabel=True, del_self_loops=True, maincc=True)[source]

Preprocess a graph according to the input parameters.

Parameters:
  • G (graph) – A NetworkX graph or digraph.
  • relabel (bool, optional) – Determines if the nodes should be relabeled with consecutive integers 0…N. Default is True.
  • del_self_loops (bool, optional) – Determines if self loops should be deleted. Default is True.
  • maincc (bool, optional) – Determines if the graphs (digraphs) should be restricted to the main (weakly) connected component or not. Default is True.
Returns:

  • G (graph) – A preprocessed NetworkX graph or digraph.
  • Ids (list of tuples) – A list of (OldNodeID, NewNodeID). Returns None if relabel=False.

Notes

For some methods trying to embed graphs with several connected components might result in inconsistent behaviours. This is the main reason for setting maincc=True.

evalne.utils.preprocess.prune_nodes(G, threshold)[source]

Removes all nodes from the graph whose degree is strictly smaller that the threshold. This can result in a disconnected graph.

Parameters:
  • G (graph) – A NetworkX graph or digraph.
  • threshold (int) – All nodes with degree lower than this value will be removed from the graph.
Returns:

H – A copy of the original graph or digraph with removed nodes.

Return type:

graph

evalne.utils.preprocess.read_edge_embeddings(input_path, ebunch_len, embed_dim, delimiter=', ', method_name=None)[source]

Reads node-pair or edge embeddings from a file and returns the results as a matrix. Each line in the file is assumed to represent the embedding of one node-pair. File headers are inferred base on the expected number of embeddings and ignored.

Parameters:
  • input_path (file or string) – File or filename to read.
  • ebunch_len (int) – The number of embeddings expected.
  • embed_dim (int) – The expected dimensionality of the embeddings.
  • delimiter (string, optional) – The string used to separate values in the input file. Default is ‘,’.
  • method_name (string, optional) – A string indicating the name of the method being evaluated. If provided will be used when logging an error. Default is None.
Returns:

Y – A column vector containing embeddings as rows.

Return type:

ndarray

Raises:

ValueError – If the number of embeddings in the file is less than ebunch_len. If the input data dimensions are not correct.

evalne.utils.preprocess.read_labels(input_path, delimiter=', ', comments='#', idx_mapping=None)[source]

Reads node labels from a file and returns them as a numpy ndarray where the first column contains nodeIDs and the second, node attributes. If idx_mapping is provided, the original indices are re-mapped to new indices.

Parameters:
  • input_path (file or string) – File or filename to read. File is assumed to contain in each line a nodeID and label pair.
  • delimiter (string, optional) – The string used to separate values in the input file. Default is ‘,’.
  • comments (string, optional) – The character used to indicate the start of a comment. Default is ‘#’.
  • idx_mapping (list of tuples) – A list of (OldNodeID, NewNodeID).
Returns:

labels – A column vector of nodeID and label pairs.

Return type:

ndarray

evalne.utils.preprocess.read_node_embeddings(input_path, nodes, embed_dim, delimiter=', ', method_name=None)[source]

Reads node embeddings from a file (see Notes for expected input file structure) and returns the results as a dictionary. Keys correspond to nodeIDs and values to vectors. File headers are inferred base on the expected number of embeddings and ignored.

Parameters:
  • input_path (file or string) – File or filename to read.
  • nodes (array_like) – The network nodes for which embeddings are expected.
  • embed_dim (int) – The expected dimensionality of the node embeddings.
  • delimiter (string, optional) – The string used to separate values in the input file. Default is ‘,’.
  • method_name (string, optional) – A string indicating the name of the method being evaluated. If provided will be used when logging an error. Default is None.
Returns:

X – A dictionary of {nodeID: embed_vect, nodeID: embed_vect, …}. Keys are type string and values array_like.

Return type:

dict

Raises:

ValueError – If the number of embeddings in the file is less than len(nodes). If the input data dimensions are not correct.

Notes

The method accepts files where each line corresponds to the embedding of one node. The nodeID can be present as the first value in each line. If the nodeID is not present, embeddings are assumed to be written in the file in ascending nodeID order.

evalne.utils.preprocess.read_predictions(input_path, ebunch_len, delimiter=', ', method_name=None)[source]

Reads node similarities from a file (see Notes for expected input file structure) and returns the results as a vector. File headers are inferred base on the expected number of predictions and ignored.

Parameters:
  • input_path (file or string) – File or filename to read.
  • ebunch_len (int) – The number of predictions expected.
  • delimiter (string, optional) – The string used to separate values in the input file. Default is ‘,’.
  • method_name (string, optional) – A string indicating the name of the method being evaluated. If provided will be used when logging an error. Default is None.
Returns:

Y – A column vector containing node similarities.

Return type:

ndarray

Raises:

ValueError – If the number of predictions in the file is less than ebunch_len. If the input data dimensions are not correct.

Notes

The method accepts files where all similarities are given in a single line in the file or one per line. In both case, the expected length of these “vectors” is ebunch_len. For input files containing similarities in rows, the method expects the last element of each row to be the similarity. This is useful for methods that return triplets: src, dst, sim.

evalne.utils.preprocess.read_train_test(input_path, split)[source]

Reads the sets of train and test edges and non-edges from the given path that share the given split ID. The method assumes that these files follow the naming convention of store_train_test_splits().

Parameters:
  • input_path (string) – The path where the input data can be found.
  • split (int) – The ID of the splits to be read.
Returns:

  • train_E (set) – Set of train edges
  • train_E_false (set) – Set of train non-edges
  • test_E (set) – Set of test edges
  • test_E_false (set) – Set of test non-edges

See also

evalne.utils.split_train_test.store_train_test_splits()
The files in the provided input path are expected to follow the naming convention of this function.
evalne.utils.preprocess.relabel_nodes(train_E, test_E, directed)[source]

For given sets of train and test edges, the method returns relabeled sets with nodes being integers in 0…N. Additionally, the method returns a graph containing all edges in the train and test and nodes in 0…N.

Parameters:
  • train_E (set) – The set of train edges.
  • test_E (set) – The set of test edges.
  • directed (bool) – Indicates if the method should return a graph or a digraph.
Returns:

  • train_E (set) – The set of train edges
  • test_E (set) – The set of test edges
  • G (graph) – A NetworkX graph or digraph with relabeled nodes containing the edges in train and test.
  • mapping (dict) – A dictionary containing old node id’s as key and new id’s as values.

evalne.utils.preprocess.save_graph(G, output_path, delimiter=', ', write_stats=True, write_weights=False, write_dir=True)[source]

Saves a graph to a file as an edgelist or weighted edgelist. If write_stats is True the file will include a header containing basic graph statistics as provided by the get_stats function.

Parameters:
  • G (graph) – A NetworkX graph or digraph.
  • output_path (file or string) – File or filename to write. If a file is provided, it must be opened in ‘wb’ mode.
  • delimiter (string, optional) – The string to use for separating values in the output file. Default is ‘,’.
  • write_stats (bool, optional) – Adds basic graph statistics to the file as a header or not. Default is True.
  • write_weights (bool, optional) – If True data will be stored as weighted edgelist i.e. triplets (src, dst, weight), otherwise, as regular (src, dst) pairs. For unweighted graphs, setting this parameter to True will add weight 1 to all edges. Default is False.
  • write_dir (bool, optional) – This parameter is only relevant for undirected graphs. If True, it forces the method to write both edge directions in the file i.e. (src, dst) and (dst, src). If False, only one direction is stored. Default is True.

See also

evalne.utils.split_train_test.get_stats()
Function used to compute simple statistics to add as file header.

evalne.utils.split_train_test module

evalne.utils.split_train_test.broder_alg(G, E)[source]

Runs Andrei Broder’s algorithm [1] to select uniformly at random a spanning tree of the input graph. The method works for directed and undirected networks. The edge directions in the resulting spanning tree are taken from the input edgelist (E). For node pairs where both edge directions exist, one is chosen at random.

Parameters:
  • G (graph) – A NetworkX graph or digraph.
  • E (set) – A set of all directed or undirected edges in G.
Returns:

train_E – A set of edges of G that form the random spanning tree.

Return type:

set

See also

evalne.utils.split_train_test.wilson_alg()
A more computationally efficient approach for selecting a spanning tree.

References

[1]A. Broder, “Generating Random Spanning Trees”, Proc. of the 30th Annual Symposium on Foundations of Computer Science, pp. 442–447, 1989.
evalne.utils.split_train_test.check_overlap(filename, num_sets)[source]

Shows the amount of overlap (shared elements) between sets of node pairs with different split IDs. This function assumes the existance of num_sets files sharing the same name and with split IDs starting from 0 to num_sets.

Parameters:
  • filename (string) – Indicates the path and name (without split ID) of the first set. The sets are assumed to have sequential split IDs starting at 0.
  • num_sets (int) – The number of sets for which to check the overlap.

See also

evalne.utils.split_train_test.store_train_test_splits()
The filename is expected to follow the naming convention of this function without “_<split_id>.csv”.

Examples

Checking the overlap of several files in the test folder containing train edges:

>>> check_overlap("./test/tr_E", 2)
Intersection of 2 sets is 10
Union of 2 sets is 15
Jaccard coefficient: 0.666666666667
evalne.utils.split_train_test.compute_splits_parallel(G, output_path, owa=True, train_frac=0.51, num_fe_train=None, num_fe_test=None, num_splits=10)[source]

Computes in parallel the required number of train/test splits of input graph edges. Also generated the same number of train and test non-edge sets. The resulting sets of node pairs are written to different files. The resulting train edge set has the following properties: spans a graph (digraph) with a single connected (weakly connected) component and the same nodes as G.

Parameters:
  • G (graph) – A NetworkX graph or digraph with a single connected (weakly connected) component.
  • output_path (string) – Indicates the path where data will be stored. Can include a name for all splits to share.
  • owa (bool, optional) – Encodes the belief that the network respects or not the open world assumption. Default is True. If owa=True, train non-edges are sampled from the train graph only and can overlap with test edges. If owa=False, train non-edges are sampled from the full graph and cannot overlap with test edges.
  • train_frac (float, optional) – The proportion of train edges w.r.t. the total number of edges in the input graph (range (0.0, 1.0]). Default is 0.51.
  • num_fe_train (int, optional) – The number of train non-edges to generate. Default is None (same number as train edges).
  • num_fe_test (int, optional) – The number of test non-edges to generate. Default is None (same number as test edges).
  • num_splits (int, optional) – The number of train/test splits to generate. Default is 10.

See also

evalne.utils.split_train_test.split_train_test()
This method used to split graph edges in train and test.
evalne.utils.split_train_test.generate_false_edges_owa()
Method used to generate non-edges if owa=True.
evalne.utils.split_train_test.generate_false_edges_cwa()
Method used to generate non-edges if owa=False.
evalne.utils.split_train_test.generate_false_edges_cwa(G, train_E, test_E, num_fe_train=None, num_fe_test=None)[source]

Randomly samples pairs of unconnected nodes or non-edges from the input graph according to the closed world assumption (see Notes). Returns sets of train and test non-edges.

Parameters:
  • G (graph) – A NetworkX graph or digraph.
  • train_E (set) – The set of train edges.
  • test_E (set) – The set of test edges.
  • num_fe_train (int, optional) – The number of train non-edges to generate. Default is None (same number as train edges).
  • num_fe_test (int, optional) – The number of test non-edges to generate. Default is None (same number as test edges).
Returns:

  • train_E_false (set) – The set of train non-edges.
  • test_E_false (set) – The set of test non-edges.

Raises:

ValueError – If the input graph G has more than one (weakly) connected component. If more non-edges than existing in the graph are required.

Notes

The closed world assumption considers that one is certain about some node pairs being non-edges. Therefore, train non-edges cannot overlap with wither train not test edges. This scenario is common for static graphs where information about both the edges (positive interactions) and non-edges (negative interactions) is knows, e.g. protein protein interaction networks.

evalne.utils.split_train_test.generate_false_edges_owa(G, train_E, test_E, num_fe_train=None, num_fe_test=None)[source]

Randomly samples pairs of unconnected nodes or non-edges from the input graph according to the open world assumption (see Notes). Returns sets of train and test non-edges.

Parameters:
  • G (graph) – A NetworkX graph or digraph.
  • train_E (set) – The set of train edges.
  • test_E (set) – The set of test edges.
  • num_fe_train (int, optional) – The number of train non-edges to generate. Default is None (same number as train edges).
  • num_fe_test (int, optional) – The number of test non-edges to generate. Default is None (same number as test edges).
Returns:

  • train_E_false (set) – The set of train non-edges.
  • test_E_false (set) – The set of test non-edges.

Raises:

ValueError – If the input graph G has more than one (weakly) connected component. If more non-edges than existing in the graph are required.

Notes

The open world assumption considers that for generating train non-edges one has access to the train edges only. Therefore, train non-edges could overlap with test edges. This scenario is common for evolving graphs where edges can only appear. In this case, one has access to certain train edges present in the graph at time t. Non-edges are then sampled using this information. At time t+1, the newly arrived test edges may have been previously selected as train non-edges. For undirected graphs the output is sorted (smallNodeID, bigNodeID).

evalne.utils.split_train_test.naive_split_train_test(G, train_frac=0.51)[source]

Splits the edges of the input graph in sets of train and test and returns the results. Split is performed using the naive split approach (see Notes). The resulting train edge set has the following properties: spans a graph (digraph) with a single connected (weakly connected) component and the same nodes as G.

Parameters:
  • G (graph) – A NetworkX graph or digraph with a single connected (weakly connected) component.
  • train_frac (float, optional) – The proportion of train edges w.r.t. the total number of edges in the input graph (range (0.0, 1.0]). Default is 0.51.
Returns:

  • train_E (set) – The set of train edges.
  • test_E (set) – The set of test edges.

Raises:

ValueError – If the train_frac parameter is not in range (0, 1]. If the input graph G has more than one (weakly) connected component.

Notes

The method proceeds as follows: (1) compute the number of test edges needed from train_frac and size of G. (2) randomly remove edges from the input graph one at a time. If removing an edge causes the graph to become disconnected, add it back, otherwise, put it in the set of test edges. (3) repeat the previous step until all test edges are selected. (4) the remaining edges constitute the train set.

evalne.utils.split_train_test.quick_nonedges(G, train_frac=0.51, fe_ratio=1.0)[source]

Randomly samples pairs of unconnected nodes or non-edges from the input graph according to the open world assumption. Is a more efficient implementation of generate_false_edges_owa which also does not require the train and test edge sets as input. Returns sets of train and test non-edges.

Parameters:
  • G (graph) – A NetworkX graph or digraph.
  • train_frac (float, optional) – The proportion of train edges w.r.t. the total number of edges in the input graph (range (0.0, 1.0]). Default is 0.51.
  • fe_ratio (float, optional) – The ratio of non-edges to edges to sample. For fr_ratio > 0 and < 1 less non-edges than edges will be generated. For fe_edges > 1 more non-edges than edges will be generated. Default 1, same amounts.
Returns:

  • train_E_false (ndarray) – Column vector of train non-edges as pairs src, dst.
  • test_E_false (ndarray) – Column vector of test non-edges as pairs src, dst.

Raises:

ValueError – If more non-edges than existing in the graph are required.

evalne.utils.split_train_test.quick_split(G, train_frac=0.51)[source]

Splits the edges of the input graph in sets of train and test and returns the results. Split is performed using the quick split approach (see Notes). The resulting train edge set has the following properties: spans a graph (digraph) with a single connected (weakly connected) component and the same nodes as G.

Parameters:
  • G (graph) – A NetworkX graph or digraph with a single connected (weakly connected) component.
  • train_frac (float, optional) – The proportion of train edges w.r.t. the total number of edges in the input graph (range (0.0, 1.0]). Default is 0.51.
Returns:

  • train_E (ndarray) – Column vector of train edges as pairs src, dst.
  • test_E (ndarray) – Column vector of test edges as pairs src, dst.

Raises:

ValueError – If the train_frac parameter is not in range (0, 1]. If the input graph G has more than one (weakly) connected component.

Notes

The method proceeds as follows: (1) a spanning tree of the input graph is generated using a depth first tree approach starting at a random node, (2) randomly selected edges are added to those of the spanning tree until train_frac is reached, (3) the remaining edges, not used in previous steps, form the test set.

evalne.utils.split_train_test.rand_split_train_test(G, train_frac=0.51)[source]

Splits the edges of the input graph in sets of train and test and returns the results. Split is performed using the random split approach (see Notes). The resulting train edge set has the following properties: spans a graph (digraph) with a single connected (weakly connected) component.

Parameters:
  • G (graph) – A NetworkX graph or digraph.
  • train_frac (float, optional) – The proportion of train edges w.r.t. the total number of edges in the input graph (range (0.0, 1.0]). Default is 0.51.
Returns:

  • train_E (set) – The set of train edges.
  • test_E (set) – The set of test edges.

Raises:

ValueError – If the train_frac parameter is not in range (0, 1].

Notes

The method proceeds as follows: (1) randomly remove 1-train_frac percent of edges from the input graph. (2) from the remaining edges compute the main connected component and these will be the train edges. (3) from the set of removed edges, those such that both end nodes exist in the train edge set computed in the previous step, are added to the final test set.

evalne.utils.split_train_test.random_edge_sample(a, samp_frac=0.01, directed=False, sample_diag=False)[source]

Samples uniformly at random positions from the input adjacency matrix. The samples are returned in two sets, one containing all edges sampled and one containing all non-edges.

Parameters:
  • a (sparse matrix) – A sparse adjacency matrix.
  • samp_frac (float, optional) – A float in range [0,1] representing the fraction of all edge pairs to sample. Default is 0.01 (1%)
  • directed (bool, optional) – A flag indicating if the adjacency matrix should be considered directed or undirected. If undirected, indices are obtained only from the lower triangle. Default is False.
  • sample_diag (bool, optional) – If True elements from the main diagonal are also sampled (i.e. self loops). Else they are excluded from the sampling process. Default is False.
Returns:

  • pos_e (ndarray) – Sampled node pairs that are edges.
  • neg_e (ndarray) – Sampled node pairs that are non-edges.

evalne.utils.split_train_test.redges_false(train_E, test_E, output_path=None)[source]

For directed graphs computes all non-edges (a->b) such that the opposite edge (a<-b) exists in the graph. It does this for both the train and test edge sets.

Parameters:
  • train_E (set) – The set of train edges.
  • test_E (set) – The set of test edges.
  • output_path (string, optional) – A path or file where to store the results. If None, results are not stored only returned. Default is None.
Returns:

  • train_redges_false (set) – A set of train edges respecting the mentioned property.
  • test_redges_false (set) – A set of test edges respecting the mentioned property.

Notes

These non-edges can be used to asses the performance of the embedding methods on predicting non-reciprocated edges.

evalne.utils.split_train_test.split_train_test(G, train_frac=0.51, st_alg='wilson')[source]

Splits the edges of the input graph in sets of train and test and returns the results. Split is performed using the spanning tree approach (see Notes). The resulting train edge set has the following properties: spans a graph (digraph) with a single connected (weakly connected) component and the same nodes as G.

Parameters:
  • G (graph) – A NetworkX graph or digraph with a single connected (weakly connected) component.
  • train_frac (float, optional) – The proportion of train edges w.r.t. the total number of edges in the input graph (range (0.0, 1.0]). Default is 0.51.
  • st_alg (string, optional) – The algorithm to use for generating the spanning tree constituting the backbone of the train set. Options are: ‘wilson’ and ‘broder’. The first option, ‘wilson’ is faster in most cases. Default is ‘wilson’.
Returns:

  • train_E (set) – The set of train edges.
  • test_E (set) – The set of test edges.

Raises:

ValueError – If the train_frac parameter is not in range (0, 1]. If the input graph G has more than one (weakly) connected component.

Notes

The method proceeds as follows: (1) a spanning tree of the input graph is selected uniformly at random using broder or wilson’s algorithm, (2) randomly selected edges are added to those of the spanning tree until train_frac is reached, (3) the remaining edges, not used in previous steps, form the test set.

evalne.utils.split_train_test.store_edgelists(train_path, test_path, train_edges, test_edges)[source]

Writes the train and test edgelists to files with the specified names.

Parameters:
  • train_path (string) – Indicates the path where the train data will be stored.
  • test_path (string) – Indicates the path where the test data will be stored.
  • train_edges (array_like) – Set of train edges.
  • test_edges (array_like) – Set of test edges.
evalne.utils.split_train_test.store_train_test_splits(output_path, train_E, train_E_false, test_E, test_E_false, split_id=0)[source]

Writes the sets of train and test edges and non-edges to separate files in the provided path. All files will share the same split number as an identifier. If any folder in the path does not exist, it will be generated.

Parameters:
  • output_path (string) – Indicates the path where data will be stored.
  • train_E (set) – Set of train edges.
  • train_E_false (set) – Set of train non-edges.
  • test_E (set) – Set of test edges.
  • test_E_false (set) – Set of test non-edges.
  • split_id (int, optional) – The ID of train/test split to be stored. Default is 0.
Returns:

filenames – A list of strings, the names and paths of the 4 files where the train and test edges and non-edges are stored.

Return type:

list

See also

evalne.utils.preprocess.read_train_test()
A function that can read the generated files.

Notes

This function generates 4 files under <output_path> named: ‘trE_<split_id>.csv’, ‘negTrE_<split_id>.csv’, ‘teE_<split_id>.csv’, ‘negTeE_<split_id>.csv’ corresponding respectively to train edges train non-edges, test edges and test non-edges.

Examples

Writes some data to files under a folder named test:

>>> train_E = ((1,2), (2,3))
>>> train_E_false = ((-1,-2), (-2,-3))
>>> test_E = ((10,20), (20,30))
>>> test_E_false = ((-10,-20), (-20,-30))
>>> store_train_test_splits("./test", train_E, train_E_false, test_E, test_E_false, split_id=0)
('./test/trE_0.csv', './test/negTrE_0.csv', './test/teE_0.csv', './test/negTeE_0.csv')
evalne.utils.split_train_test.timestamp_split(G, train_frac=0.51)[source]

Splits the edges of the input graph in sets of train and test and returns the results. Split is performed using edge timestamps (see Notes). The resulting train edge set has the following properties: spans a graph (digraph) with a single connected (weakly connected) component.

Parameters:
  • G (graph) – A NetworkX graph or digraph where edge weights are timestamps.
  • train_frac (float, optional) – The proportion of train edges w.r.t. the total number of edges in the input graph (range (0.0, 1.0]). Default is 0.51.
Returns:

  • train_E (ndarray) – Column vector of train edges as pairs src, dst.
  • test_E (ndarray) – Column vector of test edges as pairs src, dst.
  • tg (graph) – A NetworkX graph containing only the edges in the train edge set.

Raises:

ValueError – If the train_frac parameter is not in range (0, 1]. If the input graph G has more than one (weakly) connected component.

Notes

The method proceeds as follows: (1) sort all edges by timestamp. (2) randomly remove 1-train_frac percent of edges from the input graph. (3) from the remaining edges compute the main connected component and these will be the train edges. (4) from the set of removed edges, those such that both end nodes exist in the train edge set computed in the previous step, are added to the final test set.

evalne.utils.split_train_test.wilson_alg(G, E)[source]

Runs Willson’s algorithm [2] also known as loop erasing random walk to select uniformly at random a spanning tree of the input graph. The method works for directed and undirected networks [3]. The edge directions in the resulting spanning tree are taken from the input edgelist (E). For node pairs where both edge directions exist, one is chosen at random.

Parameters:
  • G (graph) – A NetworkX graph or digraph.
  • E (set) – A set of all directed or undirected edges in G.
Returns:

train_E – A set of edges of G that form the random spanning tree.

Return type:

set

See also

evalne.utils.split_train_test.broder_alg()
A different approach for selecting a spanning tree that could be faster for specific graphs.

References

[2]D. B. Wilson, “Generating Random Spanning Trees More Quickly than the Cover Time”, In Proceedings of STOC, pp. 296–303, 1996.
[3]J. G. Propp and D. B. Wilson, “How to Get a Perfectly Random Sample from a Generic Markov Chain and Generate a Random Spanning Tree of a Directed Graph”, Journal of Algorithms 27, pp. 170–217, 1998.

evalne.utils.util module

exception evalne.utils.util.TimeoutExpired[source]

Bases: Exception

evalne.utils.util.auto_import(classpath)[source]

Imports any Sklearn binary classifier from a string.

Parameters:classpath (string) – A string indicating the full path the any Sklearn classifier and its parameters.
Returns:clf – The classifier instance.
Return type:object
Raises:ValueError – If the classifier could not be imported.

Examples

Importing the SVC classifier with user-defined parameters:

>>> auto_import("sklearn.svm.SVC(C=1.0, kernel='rbf')")
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Importing a decision tree classifier with no parameters:

>>> auto_import("sklearn.ensemble.ExtraTreesClassifier()")
ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
    max_depth=None, max_features='auto', max_leaf_nodes=None,
    min_impurity_decrease=0.0, min_impurity_split=None,
    min_samples_leaf=1, min_samples_split=2,
    min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
    oob_score=False, random_state=None, verbose=0, warm_start=False)
evalne.utils.util.run(cmd, timeout, verbose)[source]

Runs the cmd command provided as input in a new process. If execution time exceeds timeout, the process is killed and a TimeoutExpired exception is raised.

Parameters:
  • cmd (string) – A string indicating the command to run on the command line.
  • timeout (int or float) – A value indicating the maximum number of second the process is allowed to run for.
  • verbose (bool) – Boolean indicating if the execution output should be shown or not (pipes stdout and stderr to devnull).
Raises:

TimeoutExpired – If the execution time exceeds the number of second indicated by timeout.

Notes

The method additionally raises ImportError, IOError and AttributeError if these are encountered during execution of the cmd command.

Examples

Runs a command that prints Start, sleeps for 5 seconds and prints Done

>>> util.run("python -c 'import time; print("Start"); time.sleep(5); print("Done")'", 7, True)
Start
Done

Same as previous command but now it does not print Done because it times out after 2 seconds

>>> util.run("python -c 'import time; print("Start"); time.sleep(5); print("Done")'", 2, True)
Start
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "EvalNE/evalne/utils/util.py", line 84, in run
    A string indicating the command to run on the command line.
TimeoutExpired: Command `python -c 'import time; print("Start"); time.sleep(5); print("Done")'` timed out
evalne.utils.util.run_function(timeout, func, *args)[source]

Runs the function provided as input in a new process. If execution time exceeds timeout, the process is killed and a TimeoutExpired exception is raised.

Parameters:
  • timeout (int or float) – A value indicating the maximum number of seconds the process is allowed to run for or None.
  • func (object) – A function to be executed. First parameter must be a queue object. Check notes section for more details.
  • *args – Variable length argument list for function func. The list shloud not include the queue object.
Raises:

TimeoutExpired – If the execution time exceeds the number of second indicated by timeout.

Notes

The input function func must take as a first parameter a queue object q. This is used to communicate results between the process where the function is running and the main thread. The list of args does not need to include the queue object, it is automatically inserted by this function.

Examples

Runs function foo for at most 100 seconds and returns the result. Foo must put the result in q.

>>> def foo(q, a, b):
...     q.put(a+b)
...
>>> run_function(100, foo, *[1, 2])
3

evalne.utils.viz_utils module

evalne.utils.viz_utils.parallel_coord(scoresheet, features, class_col='methods')[source]

Generates a parallel coordinate plot from the given Scoresheet object and the set of features specified.

Parameters:
  • scoresheet (evalne.Scoresheet) – A Scoresheet object containing the results of an evaluation.
  • features (list) – A list of strings indicating the features to show in the plot (in addition to methods and networks). Accepted features are: ‘auroc’, ‘average_precision’, ‘precision’, ‘recall’, ‘fallout’, ‘miss’, ‘accuracy’, ‘f_score’, eval_time and edge_embed_method.
  • class_col (string, optional) – Indicates the class to highlight. Options are methods and networks. Default is methods.
evalne.utils.viz_utils.plot_curve(filename, x, y, x_label, y_label, title=None)[source]

Plots y coordinates against x coordinates as a line.

Parameters:
  • filename (string) – A file or filename where to store the plot.
  • x (array_like) – The x coordinates of the plot.
  • y (array_like) – The y coordinates of the plot.
  • x_label (string) – The label of the x axis.
  • y_label (string) – The label of the y axis.
  • title (string or None, optional) – The title of the plot. Default is None (no title).
evalne.utils.viz_utils.plot_emb2d(emb, colors=None, filename=None)[source]

Generates a scatter plot of the given embeddings. Optional colors for the nodes can be provided as well as a filename to store the results. If dim of embeddings > 2, uses t-SNE to reduce it to 2.

Parameters:
  • emb (matrix) – A Numpy matrix containing the node or edge embeddings.
  • colors (array, optional) – A Numpy array containing the colors of each node. Default is None.
  • filename (string, optional) – A string indicating the path and name of the file where to store the scatter plot. If not provided the plot is shown on screen. Default is None.
evalne.utils.viz_utils.plot_graph2d(G, emb=None, labels=None, colors=None, filename=None)[source]

Plots the given graph in 2d. If the embeddings of nodes are provided, they are used to place the nodes on the 2d plane. If dim of embeddings > 2, then its reduced to 2 using t-SNE. Optional labels and colors for the nodes can be provided, as well as a filename to store the results.

Parameters:
  • G (graph) – A NetworkX graph or digraph.
  • emb (matrix, optional) – A Numpy matrix containing the node embeddings. Default is None.
  • labels (dict, optional) – A dictionary containing nodeIDs as keys and node labels as values. Default is None.
  • colors (array, optional) – A Numpy array containing the colors of each graph node. Default is None.
  • filename (string, optional) – A string indicating the path and name of the file where to store the scatter plot. If not provided the plot is showed on screen. Default is None.

Module contents