parse module¶

parse.extract_html_bs4(html_file_path: str, remove_stopwords: bool = True, enable_lemmetization: bool = False)¶

Given a path to html file it will extract all text in it and return a list of words (using library: BeautifulSoup4)

Parameters

html_file_path (str) – Path to html file, will be called with open()
remove_stopwords (bool, optional) – Will remove stopwords like [“the”, “them”,etc], defaults to False
enable_lemmetization (bool, optional) – Will lemmetize words if set to True. Ex: cats->cat, defaults to False

Returns

A list of words all in lowercase

Return type

List[str]

parse.extract_html_newspaper(html_file: str, remove_stopwords=True, enable_lemmetization=False) → List[str]¶

Given a path to html file it will extract all text in it and return a list of words (using library: Newspaper3k)

Parameters

html_file_path (str) – Path to html file, will be called with open()
remove_stopwords (bool, optional) – Will remove stopwords like [“the”, “them”,etc], defaults to False

Returns

A list of words all in lowercase

Return type

List[str]

Sthir