splearn.datasets package

Submodules

splearn.datasets.base module

splearn.datasets.base.load_data_sample(adr, filetype='SPiCe', pickle=False)

Load a sample from file and returns a dictionary (word,count)

  • Input:
Parameters:
  • adr (str) – address and name of the loaded file
  • filetype (str) – (default value = ‘SPiCe’) indicate the structure of the file. Should be either ‘SPiCe’ or ‘Pautomac’
  • pickle (boolean) – if enabled it a pickle file is created from the loaded file. Default is fault.
  • Output:
Returns:corresponding DataSample
Return type:DataSample
Example:
>>> from splearn.datasets.base import load_data_sample
>>> from splearn.tests.datasets.get_dataset_path import get_dataset_path
>>> train_file = '3.pautomac_light.train' # '4.spice.train'
>>> data = load_data_sample(adr=get_dataset_path(train_file))
>>> data.nbL
4
>>> data.nbEx
5000
>>> data.data
Splearn_array([[ 3.,  0.,  3., ..., -1., -1., -1.],
       [ 3.,  3., -1., ..., -1., -1., -1.],
       [ 3.,  2.,  0., ..., -1., -1., -1.],
       ...,
       [ 3.,  1.,  3., ..., -1., -1., -1.],
       [ 3.,  0.,  3., ..., -1., -1., -1.],
       [ 3.,  3.,  1., ..., -1., -1., -1.]])

splearn.datasets.data_sample module

This module contains the DataSample class and SplearnArray class.

class splearn.datasets.data_sample.DataSample(data=None, **kwargs)

Bases: dict

A DataSample instance

  • Input:
Parameters:data (tuple) – a tuple of (int, int, numpy.array) for the corresponding three elements (nbL, nbEx, data) where nbL is the number of letters in the alphabet, nbEx is the number of samples and data is the 2d data array
Example:
>>> from splearn.datasets.base import load_data_sample
>>> from splearn.tests.datasets.get_dataset_path import get_dataset_path
>>> train_file = '3.pautomac_light.train' # '4.spice.train'
>>> data = load_data_sample(adr=get_dataset_path(train_file))
>>> print(data.__class__)
<class 'splearn.datasets.data_sample.DataSample'>
>>> data.nbL
4
>>> data.nbEx
5000
>>> data.data
data

SplearnArray

nbEx

Number of examples

nbL

Number of letters

class splearn.datasets.data_sample.SplearnArray

Bases: numpy.ndarray

Sample data array used by the splearn spectral estimation

SplearnArray class inherit from numpy ndarray as a 2d data ndarray.

Example of a possible 2d shape:

0 1 0 3 -1
0 0 3 3 1
1 1 -1 -1 -1
5 -1 -1 -1 -1
-1 -1 -1 -1 -1

is equivalent to:

  • word (0103) or abad
  • word (00331) or aaddb
  • word (11) or bb
  • word (5) or f
  • word () or empty

Each line represents a word of the sample. The words are represented by integer letters (0->a, 1->b, 2->c …). -1 indicates the end of the word. The number of rows is the total number of words in the sample (=nbEx) and the number of columns is given by the size of the longest word. Notice that the total number of words does not care about the words’ duplications. If a word is duplicated in the sample, it is counted twice as two different examples.

The DataSample class encapsulates also the sample’s parameters ‘nbL’, ‘nbEx’ (number of letters in the alphabet and number of samples) and the fourth dictionaries ‘sample’, ‘prefix’, ‘suffix’ and ‘factor’ that will be populated during the fit estimations.

  • Input:
Parameters:
  • input_array (nd.array) – input ndarray that will be converted into SplearnArray
  • nbL (int) – the number of letters
  • nbEx (int) – total number of examples.
  • sample (dict) – the keys are the words and the values are the number of time it appears in the sample.
  • pref (dict) – the keys are the prefixes and the values are the number of time it appears in the sample.
  • suff (dict) – the keys are the suffixes and the values are the number of time it appears in the sample.
  • fact (dict) – the keys are the factors and the values are the number of time it appears in the sample.
Example:
>>> from splearn.datasets.base import load_data_sample
>>> from splearn.tests.datasets.get_dataset_path import get_dataset_path
>>> train_file = '3.pautomac_light.train' # '4.spice.train'
>>> data = load_data_sample(adr=get_dataset_path(train_file))
>>> print(data.__class__)
>>> data.data
<class 'splearn.datasets.data_sample.DataSample'>
SplearnArray([[ 3.,  0.,  3., ..., -1., -1., -1.],
    [ 3.,  3., -1., ..., -1., -1., -1.],
    [ 3.,  2.,  0., ..., -1., -1., -1.],
    ...,
    [ 3.,  1.,  3., ..., -1., -1., -1.],
    [ 3.,  0.,  3., ..., -1., -1., -1.],
    [ 3.,  3.,  1., ..., -1., -1., -1.]])

Module contents