spectral_bloom_filter module¶
-
class
spectral_bloom_filter.
Hash_Funcs
(k: int, m: int)¶ Bases:
object
-
static
check_duplicates
(indices_list: list)¶
-
check_hashes
(word_list: list)¶ Logs the duplicate hashed indices for words in words_list
- Parameters
word_list – List of words
-
get_hashes
(word: str) → list¶ Returns a list of k hashed indices for the input word
- Parameters
word – Word to be hashed
- Returns
List of hashes of the word
-
static
-
class
spectral_bloom_filter.
Spectral_Bloom_Filter
(error_rate: float = 0.01)¶ Bases:
object
Creates a Spectral Bloom Filter using the words parsed from the documents
Paper: SIGMOD ‘03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 2003 Pages 241–252-
create_filter
(tokens: list, m: int, chunk_size: int = 4, no_hashes: int = 5, method: str = 'minimum', to_bitarray: bool = True, bitarray_path: str = 'document.bin') → bitarray.bitarray¶ Creates a spectral bloom filter.
Paper: SIGMOD ‘03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, June 2003 Pages 241–252- Parameters
tokens – List of words to index in spectral bloom filter
m – size of the bitarray
chunk_size – Size of each counter in Spectral Bloom Filter (default: 4). Default of 4 means that the maximum increment a counter. Can perform is 2**4, which is 16.
no_hashes – No. of hashes to index word with, (default: 5)
method – Currently only “minimum” is supported, (default: “minimum”). “minimum” stands for Minimum Increment
to_bitarray – If True, will convert and save as bitarray in bitarray_path. If False, method will return list of lists containing the entire bitarray with chunks. (Default: True).
bitarray_path – Path to store the bitarray, (default:”document.bin”).
-
create_hashes
(token: str, hashes: int, max_length: int) → list¶ Get the hased indices for the string
- Parameters
token – token to index
hashes – no. of hashes (k)
max_length – maximum length of the hash (m)
- Returns
list of hashes
-
gen_counter_chunks
(string: str, chunk_size: int, drop_remaining: bool = False) → Iterable[str]¶ Yields an iterator of chunks of specified size
If drop_remaining is specified, the iterator is guaranteed to have all chunks of same size.
>>> list(gen_counter_chunks('123456789A', 4)) == ['1234', '5678', '9A'] >>> list(gen_counter_chunks('123456789A', 4, drop_remaining = True)) == ['1234', '5678']
- Parameters
string – bit string whose chunks are to be obtained
chunk_size – size of each chunk (optimal: 4)
drop_remaining – to drop the extra string, if left, (default: False)
- Returns
generator object containing the list of chunks
-
init_counter
(counter_length: int) → dict¶ To initialize a binary counter for incrementing Spectral Bloom Filter’s counters.
Example: For counter_length = 2 Method returns - {‘00’: ‘01’, ‘01’: ‘10’, ‘10’: ‘11’, ‘11’: ‘11’}
- Parameters
counter_length – No. of bits in each counter
- Returns
Dictionary used for binary counter operation
-
initialize_string
(length: int)¶ Returns string of zeros of width “length”.
- Parameters
length – size of the string
- Returns
string of 0s of the specified length
-
optimal_m_k
(n: int, p: int) → tuple¶ From: https://stackoverflow.com/questions/658439/how-many-hash-functions-does-my-bloom-filter-need
- Parameters
n – items expected in filter
p – false positive rate
chunk_size – number of bits in each counter
- Returns
Tuple containing: m for number of bits needed in the bloom filter (index 0) and k for number of hash functions we should apply (index 1)
-