spectral_bloom_filter module

class spectral_bloom_filter.Hash_Funcs(k: int, m: int)

Bases: object

static check_duplicates(indices_list: list)
check_hashes(word_list: list)

Logs the duplicate hashed indices for words in words_list

Parameters

word_list – List of words

get_hashes(word: str) → list

Returns a list of k hashed indices for the input word

Parameters

word – Word to be hashed

Returns

List of hashes of the word

class spectral_bloom_filter.Spectral_Bloom_Filter(error_rate: float = 0.01)

Bases: object

Creates a Spectral Bloom Filter using the words parsed from the documents

Paper: SIGMOD ‘03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 2003 Pages 241–252
create_filter(tokens: list, m: int, chunk_size: int = 4, no_hashes: int = 5, method: str = 'minimum', to_bitarray: bool = True, bitarray_path: str = 'document.bin') → bitarray.bitarray

Creates a spectral bloom filter.

Paper: SIGMOD ‘03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 2003 Pages 241–252
Parameters
  • tokens – List of words to index in spectral bloom filter

  • m – size of the bitarray

  • chunk_size – Size of each counter in Spectral Bloom Filter (default: 4). Default of 4 means that the maximum increment a counter. Can perform is 2**4, which is 16.

  • no_hashes – No. of hashes to index word with, (default: 5)

  • method – Currently only “minimum” is supported, (default: “minimum”). “minimum” stands for Minimum Increment

  • to_bitarray – If True, will convert and save as bitarray in bitarray_path. If False, method will return list of lists containing the entire bitarray with chunks. (Default: True).

  • bitarray_path – Path to store the bitarray, (default:”document.bin”).

create_hashes(token: str, hashes: int, max_length: int) → list

Get the hased indices for the string

Parameters
  • token – token to index

  • hashes – no. of hashes (k)

  • max_length – maximum length of the hash (m)

Returns

list of hashes

gen_counter_chunks(string: str, chunk_size: int, drop_remaining: bool = False) → Iterable[str]

Yields an iterator of chunks of specified size

If drop_remaining is specified, the iterator is guaranteed to have all chunks of same size.

>>> list(gen_counter_chunks('123456789A', 4)) == ['1234', '5678', '9A']
>>> list(gen_counter_chunks('123456789A', 4, drop_remaining = True)) == ['1234', '5678']
Parameters
  • string – bit string whose chunks are to be obtained

  • chunk_size – size of each chunk (optimal: 4)

  • drop_remaining – to drop the extra string, if left, (default: False)

Returns

generator object containing the list of chunks

init_counter(counter_length: int) → dict

To initialize a binary counter for incrementing Spectral Bloom Filter’s counters.

Example: For counter_length = 2 Method returns - {‘00’: ‘01’, ‘01’: ‘10’, ‘10’: ‘11’, ‘11’: ‘11’}

Parameters

counter_length – No. of bits in each counter

Returns

Dictionary used for binary counter operation

initialize_string(length: int)

Returns string of zeros of width “length”.

Parameters

length – size of the string

Returns

string of 0s of the specified length

optimal_m_k(n: int, p: int) → tuple

From: https://stackoverflow.com/questions/658439/how-many-hash-functions-does-my-bloom-filter-need

Parameters
  • n – items expected in filter

  • p – false positive rate

  • chunk_size – number of bits in each counter

Returns

Tuple containing: m for number of bits needed in the bloom filter (index 0) and k for number of hash functions we should apply (index 1)