Meridianmeridian

HTML Content

container.object.html

HTML (HyperText Markup Language) content stored as VARCHAR. Detected by the presence of HTML tags (<p>, <div>, <a href=, <br>, <img>, etc.). Unlike XML, HTML5 allows unclosed tags, unquoted attributes, optional closing tags, and void elements. Common in CMS exports, email templates, web scraping data, and rich text fields.

Domain
container
Category
object
Casts to
VARCHAR
Scope
Universal

Try it

CLI
$ finetype infer -i "<p>Hello world</p>"
→ container.object.html

DuckDB

Detect
SELECT finetype('<p>Hello world</p>');
-- → 'container.object.html'
Cast expression
REGEXP_REPLACE({col}, '<[^>]+>', '', 'g')
Safe cast pipeline
-- Normalise and cast in one step
SELECT TRY_CAST(finetype_cast(my_column) AS VARCHAR) AS clean_value
FROM my_table
WHERE finetype(my_column) = 'container.object.html';

Struct Expansion

Expression
tag_count: CAST(REGEXP_COUNT({col}, '<[a-zA-Z][^>]*>') AS INTEGER)
text_content: REGEXP_REPLACE({col}, '<[^>]+>', '', 'g')

JSON Schema

finetype schema container.object.html
{
  "$id": "https://meridian.online/schemas/container.object.html",
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "description": "HTML (HyperText Markup Language) content stored as VARCHAR. Detected by the presence of HTML tags (<p>, <div>, <a href=, <br>, <img>, etc.). Unlike XML, HTML5 allows unclosed tags, unquoted attributes, optional closing tags, and void elements. Common in CMS exports, email templates, web scraping data, and rich text fields.",
  "examples": [
    "<p>Hello world</p>",
    "<div class=\"test\"><a href=\"url\">link</a></div>",
    "<br><img src=\"photo.jpg\">",
    "<h1>Title</h1><p>Content here.</p>",
    "<ul><li>Item 1</li><li>Item 2</li></ul>",
    "<table><tr><td>Cell</td></tr></table>"
  ],
  "minLength": 3,
  "pattern": "^.*<(p|div|span|a|br|img|h[1-6]|ul|ol|li|table|tr|td|th|strong|em|b|i|form|input|button|select|textarea|header|footer|nav|section|article|main|aside|figure|figcaption|blockquote|pre|code|script|style|link|meta|head|body|html)[\\s>/ ].*$",
  "title": "HTML Content",
  "type": "string",
  "x-finetype-broad-type": "VARCHAR",
  "x-finetype-transform": "REGEXP_REPLACE({col}, '<[^>]+>', '', 'g')"
}

Examples

<p>Hello world</p><div class="test"><a href="url">link</a></div><br><img src="photo.jpg"><h1>Title</h1><p>Content here.</p><ul><li>Item 1</li><li>Item 2</li></ul><table><tr><td>Cell</td></tr></table>

Aliases

html_contenthtml_fragment