scan module¶

scan.create_search_page(directory, output_file='search.html', false_positive=0.1, chunk_size=4, remove_stopwords=True)¶

Generates the search output file using the directory path.

Parameters

directory – Directory path where HTML files are located
output_file – name of the output file (Default - “search.html”)
false_positive – Acceptable false positive rate during search (Default - 0.1) 0.01 is a better alternative, at the cost of increase in file size.
chunk_size – Size of each counter in Spectral Bloom Filter (Default - 4) Default of 4 means that the maximum increment a counter can perform is 2**4, which is 16.
remove_stopwords – To remove stopwords (Default - True)

It saves the search file in the output_file path.

scan.download_urls(json_file, output_file='')¶: Downloads and saves HTML files using a JSON file containing list of URLs. (For Debugging purposes)

scan.generate_bloom_filter(file, false_positive=0.1, chunk_size=4, remove_stopwords=True)¶: Generates a bloom filter and saves it in .bin file.

The saved .bin filename is same as that of the .html file name.

Returns a dictionary containing the -

length of the bitarray (m), no of hash functions used (k), chunk size (chunk_size), binary file name (bin_file), and HTML file’s title (title).

This method is internally used in method - create_search_page

scan.get_all_bin_files(directory)¶: Returns list of bin files located in the directory

scan.get_all_html_files(directory)¶: Returns list of html files located in the directory

Sthir