parse module

parse.extract_html_bs4(html_file_path: str, remove_stopwords: bool = True, enable_lemmetization: bool = False)

Given a path to html file it will extract all text in it and return a list of words (using library: BeautifulSoup4)

Parameters
  • html_file_path (str) – Path to html file, will be called with open()

  • remove_stopwords (bool, optional) – Will remove stopwords like [“the”, “them”,etc], defaults to False

  • enable_lemmetization (bool, optional) – Will lemmetize words if set to True. Ex: cats->cat, defaults to False

Returns

A list of words all in lowercase

Return type

List[str]

parse.extract_html_newspaper(html_file: str, remove_stopwords=True, enable_lemmetization=False) → List[str]

Given a path to html file it will extract all text in it and return a list of words (using library: Newspaper3k)

Parameters
  • html_file_path (str) – Path to html file, will be called with open()

  • remove_stopwords (bool, optional) – Will remove stopwords like [“the”, “them”,etc], defaults to False

Returns

A list of words all in lowercase

Return type

List[str]