spectral_bloom_filter module¶

class spectral_bloom_filter.Hash_Funcs(k: int, m: int)¶

Bases: object

static check_duplicates(indices_list: list)¶

check_hashes(word_list: list)¶

Logs the duplicate hashed indices for words in words_list

Parameters: word_list – List of words

get_hashes(word: str) → list¶

Returns a list of k hashed indices for the input word

Parameters: word – Word to be hashed
Returns: List of hashes of the word

class spectral_bloom_filter.Spectral_Bloom_Filter(error_rate: float = 0.01)¶

Bases: object

Creates a Spectral Bloom Filter using the words parsed from the documents

Paper: SIGMOD ‘03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 2003 Pages 241–252

DOI: https://doi.org/10.1145/872757.872787

create_filter(tokens: list, m: int, chunk_size: int = 4, no_hashes: int = 5, method: str = 'minimum', to_bitarray: bool = True, bitarray_path: str = 'document.bin') → bitarray.bitarray¶

Creates a spectral bloom filter.

Paper: SIGMOD ‘03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 2003 Pages 241–252

DOI: https://doi.org/10.1145/872757.872787

Parameters

tokens – List of words to index in spectral bloom filter
m – size of the bitarray
chunk_size – Size of each counter in Spectral Bloom Filter (default: 4). Default of 4 means that the maximum increment a counter. Can perform is 2**4, which is 16.
no_hashes – No. of hashes to index word with, (default: 5)
method – Currently only “minimum” is supported, (default: “minimum”). “minimum” stands for Minimum Increment
to_bitarray – If True, will convert and save as bitarray in bitarray_path. If False, method will return list of lists containing the entire bitarray with chunks. (Default: True).
bitarray_path – Path to store the bitarray, (default:”document.bin”).

create_hashes(token: str, hashes: int, max_length: int) → list¶

Get the hased indices for the string

Parameters

token – token to index
hashes – no. of hashes (k)
max_length – maximum length of the hash (m)

Returns

list of hashes

gen_counter_chunks(string: str, chunk_size: int, drop_remaining: bool = False) → Iterable[str]¶

Yields an iterator of chunks of specified size

If drop_remaining is specified, the iterator is guaranteed to have all chunks of same size.

>>> list(gen_counter_chunks('123456789A', 4)) == ['1234', '5678', '9A']
>>> list(gen_counter_chunks('123456789A', 4, drop_remaining = True)) == ['1234', '5678']

Parameters

string – bit string whose chunks are to be obtained
chunk_size – size of each chunk (optimal: 4)
drop_remaining – to drop the extra string, if left, (default: False)

Returns

generator object containing the list of chunks

init_counter(counter_length: int) → dict¶

To initialize a binary counter for incrementing Spectral Bloom Filter’s counters.

Example: For counter_length = 2 Method returns - {‘00’: ‘01’, ‘01’: ‘10’, ‘10’: ‘11’, ‘11’: ‘11’}

Parameters: counter_length – No. of bits in each counter
Returns: Dictionary used for binary counter operation

initialize_string(length: int)¶

Returns string of zeros of width “length”.

Parameters: length – size of the string
Returns: string of 0s of the specified length

optimal_m_k(n: int, p: int) → tuple¶

From: https://stackoverflow.com/questions/658439/how-many-hash-functions-does-my-bloom-filter-need

Parameters

n – items expected in filter
p – false positive rate
chunk_size – number of bits in each counter

Returns

Tuple containing: m for number of bits needed in the bloom filter (index 0) and k for number of hash functions we should apply (index 1)

spectral_bloom_filter module¶

Sthir

Navigation

Related Topics