parse module¶
-
parse.
extract_html_bs4
(html_file_path: str, remove_stopwords: bool = True, enable_lemmetization: bool = False)¶ Given a path to html file it will extract all text in it and return a list of words (using library: BeautifulSoup4)
- Parameters
html_file_path (str) – Path to html file, will be called with open()
remove_stopwords (bool, optional) – Will remove stopwords like [“the”, “them”,etc], defaults to False
enable_lemmetization (bool, optional) – Will lemmetize words if set to True. Ex: cats->cat, defaults to False
- Returns
A list of words all in lowercase
- Return type
List[str]
-
parse.
extract_html_newspaper
(html_file: str, remove_stopwords=True, enable_lemmetization=False) → List[str]¶ Given a path to html file it will extract all text in it and return a list of words (using library: Newspaper3k)
- Parameters
html_file_path (str) – Path to html file, will be called with open()
remove_stopwords (bool, optional) – Will remove stopwords like [“the”, “them”,etc], defaults to False
- Returns
A list of words all in lowercase
- Return type
List[str]