HTMLSectionSplitter#
- class langchain_text_splitters.html.HTMLSectionSplitter(headers_to_split_on: List[Tuple[str, str]], xslt_path: str | None = None, **kwargs: Any)[source]#
Splitting HTML files based on specified tag and font sizes. Requires lxml package.
Create a new HTMLSectionSplitter.
- Parameters:
headers_to_split_on (List[Tuple[str, str]]) – list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(“h1”, “Header 1”), (“h2”, “Header 2”].
xslt_path (Optional[str]) – path to xslt file for document transformation.
passed. (Uses a default if not)
layouts. (Needed for html contents that using different format and)
kwargs (Any)
Methods
__init__
(headers_to_split_on[, xslt_path])Create a new HTMLSectionSplitter.
convert_possible_tags_to_header
(html_content)create_documents
(texts[, metadatas])Create documents from a list of texts.
split_documents
(documents)Split documents.
split_html_by_headers
(html_doc)split_text
(text)Split HTML text string
split_text_from_file
(file)Split HTML file
- __init__(headers_to_split_on: List[Tuple[str, str]], xslt_path: str | None = None, **kwargs: Any) None [source]#
Create a new HTMLSectionSplitter.
- Parameters:
headers_to_split_on (List[Tuple[str, str]]) – list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(“h1”, “Header 1”), (“h2”, “Header 2”].
xslt_path (str | None) – path to xslt file for document transformation.
passed. (Uses a default if not)
layouts. (Needed for html contents that using different format and)
kwargs (Any)
- Return type:
None
- convert_possible_tags_to_header(html_content: str) str [source]#
- Parameters:
html_content (str)
- Return type:
str
- create_documents(texts: List[str], metadatas: List[dict] | None = None) List[Document] [source]#
Create documents from a list of texts.
- Parameters:
texts (List[str])
metadatas (List[dict] | None)
- Return type:
List[Document]
- split_html_by_headers(html_doc: str) List[Dict[str, str | None]] [source]#
- Parameters:
html_doc (str)
- Return type:
List[Dict[str, str | None]]