HTMLSectionSplitter#

class langchain_text_splitters.html.HTMLSectionSplitter(headers_to_split_on: List[Tuple[str, str]], xslt_path: str | None = None, **kwargs: Any)[source]#

Splitting HTML files based on specified tag and font sizes. Requires lxml package.

Create a new HTMLSectionSplitter.

Parameters:
  • headers_to_split_on (List[Tuple[str, str]]) – list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(“h1”, “Header 1”), (“h2”, “Header 2”].

  • xslt_path (Optional[str]) – path to xslt file for document transformation.

  • passed. (Uses a default if not)

  • layouts. (Needed for html contents that using different format and)

  • kwargs (Any)

Methods

__init__(headers_to_split_on[, xslt_path])

Create a new HTMLSectionSplitter.

convert_possible_tags_to_header(html_content)

create_documents(texts[, metadatas])

Create documents from a list of texts.

split_documents(documents)

Split documents.

split_html_by_headers(html_doc)

split_text(text)

Split HTML text string

split_text_from_file(file)

Split HTML file

__init__(headers_to_split_on: List[Tuple[str, str]], xslt_path: str | None = None, **kwargs: Any) None[source]#

Create a new HTMLSectionSplitter.

Parameters:
  • headers_to_split_on (List[Tuple[str, str]]) – list of tuples of headers we want to track mapped to (arbitrary) keys for metadata. Allowed header values: h1, h2, h3, h4, h5, h6 e.g. [(“h1”, “Header 1”), (“h2”, “Header 2”].

  • xslt_path (str | None) – path to xslt file for document transformation.

  • passed. (Uses a default if not)

  • layouts. (Needed for html contents that using different format and)

  • kwargs (Any)

Return type:

None

convert_possible_tags_to_header(html_content: str) str[source]#
Parameters:

html_content (str)

Return type:

str

create_documents(texts: List[str], metadatas: List[dict] | None = None) List[Document][source]#

Create documents from a list of texts.

Parameters:
  • texts (List[str])

  • metadatas (List[dict] | None)

Return type:

List[Document]

split_documents(documents: Iterable[Document]) List[Document][source]#

Split documents.

Parameters:

documents (Iterable[Document])

Return type:

List[Document]

split_html_by_headers(html_doc: str) List[Dict[str, str | None]][source]#
Parameters:

html_doc (str)

Return type:

List[Dict[str, str | None]]

split_text(text: str) List[Document][source]#

Split HTML text string

Parameters:

text (str) – HTML text

Return type:

List[Document]

split_text_from_file(file: Any) List[Document][source]#

Split HTML file

Parameters:

file (Any) – HTML file

Return type:

List[Document]

Examples using HTMLSectionSplitter#