HTMLSectionSplitter#

class langchain_text_splitters.html.HTMLSectionSplitter(headers_to_split_on: List[Tuple[str, str]], xslt_path: str | None = None, **kwargs: Any)[source]#

Splitting HTML files based on specified tag and font sizes. Requires lxml package.

Create a new HTMLSectionSplitter.

Parameters:

headers_to_split_on (List[Tuple[str, str]]) – list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(“h1”, “Header 1”), (“h2”, “Header 2”].
xslt_path (Optional[str]) – path to xslt file for document transformation.
passed. (Uses a default if not)
layouts. (Needed for html contents that using different format and)
kwargs (Any)

Methods

`__init__`(headers_to_split_on[, xslt_path])	Create a new HTMLSectionSplitter.
`convert_possible_tags_to_header`(html_content)
`create_documents`(texts[, metadatas])	Create documents from a list of texts.
`split_documents`(documents)	Split documents.
`split_html_by_headers`(html_doc)
`split_text`(text)	Split HTML text string
`split_text_from_file`(file)	Split HTML file

__init__(headers_to_split_on: List[Tuple[str, str]], xslt_path: str | None = None, **kwargs: Any) → None[source]#

Create a new HTMLSectionSplitter.

Parameters:

headers_to_split_on (List[Tuple[str, str]]) – list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(“h1”, “Header 1”), (“h2”, “Header 2”].
xslt_path (str | None) – path to xslt file for document transformation.
passed. (Uses a default if not)
layouts. (Needed for html contents that using different format and)
kwargs (Any)

Return type:

None

convert_possible_tags_to_header(html_content: str) → str[source]#

Parameters:: html_content (str)
Return type:: str

create_documents(texts: List[str], metadatas: List[dict] | None = None) → List[Document][source]#

Create documents from a list of texts.

Parameters:

texts (List[str])
metadatas (List[dict] | None)

Return type:

List[Document]

split_documents(documents: Iterable[Document]) → List[Document][source]#

Split documents.

Parameters:: documents (Iterable[Document])
Return type:: List[Document]

split_html_by_headers(html_doc: str) → List[Dict[str, str | None]][source]#

Parameters:: html_doc (str)
Return type:: List[Dict[str, str | None]]

split_text(text: str) → List[Document][source]#

Split HTML text string

Parameters:: text (str) – HTML text
Return type:: List[Document]

split_text_from_file(file: Any) → List[Document][source]#

Split HTML file

Parameters:: file (Any) – HTML file
Return type:: List[Document]

Examples using HTMLSectionSplitter#

How to split by HTML sections