scan module

scan.create_search_page(directory, output_file='search.html', false_positive=0.1, chunk_size=4, remove_stopwords=True)

Generates the search output file using the directory path.

Parameters
  • directory – Directory path where HTML files are located

  • output_file – name of the output file (Default - “search.html”)

  • false_positive – Acceptable false positive rate during search (Default - 0.1) 0.01 is a better alternative, at the cost of increase in file size.

  • chunk_size – Size of each counter in Spectral Bloom Filter (Default - 4) Default of 4 means that the maximum increment a counter can perform is 2**4, which is 16.

  • remove_stopwords – To remove stopwords (Default - True)

It saves the search file in the output_file path.

scan.download_urls(json_file, output_file='')

Downloads and saves HTML files using a JSON file containing list of URLs. (For Debugging purposes)

scan.generate_bloom_filter(file, false_positive=0.1, chunk_size=4, remove_stopwords=True)
Generates a bloom filter and saves it in .bin file.
The saved .bin filename is same as that of the .html file name.
Returns a dictionary containing the -
length of the bitarray (m), no of hash functions used (k), chunk size (chunk_size), binary file name (bin_file), and HTML file’s title (title).

This method is internally used in method - create_search_page

scan.get_all_bin_files(directory)

Returns list of bin files located in the directory

scan.get_all_html_files(directory)

Returns list of html files located in the directory