Golang HTML to Text: Strip Tags, Parse HTML, and Convert HTTP Body

Convert HTML to text in Go, strip HTML tags, read HTTP response body with io.ReadAll, parse HTML with golang.org/x/net/html, extract body text or body HTML, and know when html2text or bluemonday fits better.

Published

Updated

Read time 9 min read

Reviewed byDeepak Prasad

Golang HTML to Text: Strip Tags, Parse HTML, and Convert HTTP Body

This page ties together tasks people mix up: turning an HTTP response body into a Go string, stripping tags for rough plain text, parsing HTML to find the <body>, and optionally serializing body markup again. Those are different steps in the same pipeline. For regular-expression basics used in tiny strippers, see regular expressions in Go.

Tested on: Go 1.22, 64-bit Linux. Snippets that import golang.org/x/net/html, third-party modules, or perform real HTTP requests are marked {run=false} and are meant to run inside a small module on your machine (go mod init, then go get as needed).


Quick answer: body bytes, raw HTML string, then meaning

To put an HTTP response body in a string, read resp.Body with io.ReadAll, handle errors, and convert with string(body). That string still contains markup if the server returned HTML.

To produce plain text from HTML, parse with golang.org/x/net/html (tokenizer or html.Parse) or use a dedicated html2text library. Regex can strip simple tags on small, trusted fragments but is not a general HTML solution.


Convert HTML to text in Go

Start with what you actually need:

Goal What it means
Response body → string Bytes from the wire as a Go string (often still HTML)
Strip HTML tags Remove <…> segments to approximate plain text
HTML → readable text Decode entities, spacing, lists, links—usually needs a parser or library
Body text Plain text taken from nodes inside <body>
Body HTML Markup inside or including <body> serialized back to a string
Sanitize HTML Allow safe tags/attrs for browser output—different from “text only”

Example: for <p>Hello <b>Go</b></p>, plain text might be Hello Go (with spaces handled by your extractor). Body HTML might be the inner <p>…</p> or the full <body>…</body> depending on what you serialize.

When you only need plain text

Prefer a tokenizer walk, a tree walk after html.Parse, or an html2text package when layout and entities matter.

When you should parse HTML instead of using regex

HTML is not a regular language. Malformed tags, attributes containing >, comments, CDATA, and <script> / <style> blocks break naive patterns. Use x/net/html or a library for real pages.


Read HTTP response body as string

This matches searches such as golang http body to string and golang response body to string: you only move bytes into memory and cast to string. Tags remain until you strip or parse. For building HTTP clients and handlers, see Golang HTTP.

Use io.ReadAll(resp.Body) after a successful Get or Do, check errors, then string(body). Always defer resp.Body.Close() so the client can reuse connections. io.ReadAll loads the entire body into memory; for large downloads, stream with io.Copy to a file or cap with io.LimitReader.

go
package main

import (
	"fmt"
	"io"
	"net/http"
	"strings"
	"time"
)

func main() {
	client := &http.Client{Timeout: 10 * time.Second}
	resp, err := client.Get("https://example.com/")
	if err != nil {
		panic(err)
	}
	defer resp.Body.Close()
	body, err := io.ReadAll(resp.Body)
	if err != nil {
		panic(err)
	}
	html := string(body)
	prefix := strings.TrimSpace(html)
	if len(prefix) > 60 {
		prefix = prefix[:60] + "…"
	}
	fmt.Println("bytes:", len(body), "prefix:", prefix)
}

Run locally with network access; you should see a byte count and a short ASCII prefix of the document. The string still includes <!DOCTYPE, tags, and entities—reading the body is not HTML-to-text by itself.


Strip HTML tags from a string

Simple tag removal for trusted small input

A pattern such as <[^>]*> with regexp.ReplaceAllString can strip angle-bracket chunks on tight, trusted snippets. Treat the result as best-effort.

go
package main

import (
	"fmt"
	"regexp"
)

func main() {
	re := regexp.MustCompile(`<[^>]*>`)
	html := `<div><h1>GoLinuxCloud</h1><p>This is an html document!</p></div>`
	plain := re.ReplaceAllString(html, "")
	fmt.Println(plain)
}
Output

Run removes tags; adjacent text can glue together (Hello</p><p>WorldHelloWorld) unless you insert spaces or use a parser-aware extractor.

Why regex is not reliable for real HTML

Nested tags, > inside attributes, SGML-style oddities, and executable regions inside <script> make regex stripping unsafe as a general strategy. Escalate to x/net/html or html2text when the input is arbitrary.


Parse HTML using golang.org/x/net/html

The extended library provides an HTML5 tokenizer and html.Parse for a html.Node tree.

Approach Best for
html.NewTokenizer Streaming scan, collecting TextToken payloads
html.Parse Finding <body>, <title>, walking children, rendering subtrees

Add the module to your project: go get golang.org/x/net/html.

Use tokenizer for text extraction

go
package main

import (
	"fmt"
	"strings"

	"golang.org/x/net/html"
)

func textTokens(htmlInput string) []string {
	tkn := html.NewTokenizer(strings.NewReader(htmlInput))
	var out []string
	for {
		switch tkn.Next() {
		case html.ErrorToken:
			return out
		case html.TextToken:
			t := tkn.Token()
			if s := strings.TrimSpace(t.Data); s != "" {
				out = append(out, s)
			}
		}
	}
}

func main() {
	doc := `<!DOCTYPE html><html><head><title>Pets</title></head><body>
<p>A list of pets</p><ul><li>dog</li><li>cat</li></ul>
<footer>GoLinuxCloud</footer></body></html>`
	fmt.Println(textTokens(doc))
}

You should see a slice of non-empty fragments (including <title> text unless you filter by context). For production text, skip script, style, and head tokens or switch to a tree walk.

Use node parser when you need the body element

html.Parse returns the document root; descend to locate the <body> element and recurse only inside it for text or rendering.


Extract text from the HTML body tag

Typical flow: parse → find <body> → walk child nodes → collect html.TextNode data → normalize spaces.

Skip or strip script, style, and usually head content when building reader-facing text. Between block elements (p, div, br, li, headings), insert spaces or newlines so <p>Hello</p><p>World</p> does not become HelloWorld.

go
package main

import (
	"fmt"
	"strings"

	"golang.org/x/net/html"
)

func findBody(n *html.Node) *html.Node {
	if n.Type == html.ElementNode && n.Data == "body" {
		return n
	}
	for c := n.FirstChild; c != nil; c = c.NextSibling {
		if b := findBody(c); b != nil {
			return b
		}
	}
	return nil
}

func textFromBody(body *html.Node) string {
	var b strings.Builder
	var walk func(*html.Node)
	walk = func(n *html.Node) {
		switch n.Type {
		case html.TextNode:
			b.WriteString(n.Data)
		case html.ElementNode:
			switch n.Data {
			case "script", "style":
				return
			case "p", "div", "br", "li", "h1", "h2", "h3", "h4", "h5", "h6":
				if b.Len() > 0 {
					b.WriteByte(' ')
				}
			}
		}
		for c := n.FirstChild; c != nil; c = c.NextSibling {
			walk(c)
		}
	}
	walk(body)
	return strings.Join(strings.Fields(b.String()), " ")
}

func main() {
	doc := `<!DOCTYPE html><html><head><title>X</title></head><body><p>Hello</p><p>World</p></body></html>`
	root, err := html.Parse(strings.NewReader(doc))
	if err != nil {
		panic(err)
	}
	body := findBody(root)
	if body == nil {
		panic("no body")
	}
	fmt.Println(textFromBody(body))
}

Run with go get golang.org/x/net/html; you should see Hello World with a separating space instead of HelloWorld.


Extract the HTML body as a string

This is not the same as plain text: you keep tags and serialize a subtree.

Requirement Typical output
Text inside <body> Plain string
Inner body HTML Serialized children of <body>
Full <body>…</body> Include the body element in rendering

html.Render writes a *html.Node subtree to an io.Writer. Render the <body> node for the outer tags, or loop FirstChild on body if you want inner HTML only.

go
package main

import (
	"bytes"
	"fmt"
	"strings"

	"golang.org/x/net/html"
)

func findBody(n *html.Node) *html.Node {
	if n.Type == html.ElementNode && n.Data == "body" {
		return n
	}
	for c := n.FirstChild; c != nil; c = c.NextSibling {
		if b := findBody(c); b != nil {
			return b
		}
	}
	return nil
}

func main() {
	doc := `<!DOCTYPE html><html><head></head><body><p id="x">Hi</p></body></html>`
	root, err := html.Parse(strings.NewReader(doc))
	if err != nil {
		panic(err)
	}
	body := findBody(root)
	var buf bytes.Buffer
	if err := html.Render(&buf, body); err != nil {
		panic(err)
	}
	fmt.Println(buf.String())
}

You should get a string starting with <body> that still contains the <p> markup—suitable for storage or further processing, not for “human plain text.”


Use an html2text package

When you want emails, CMS HTML, or scraped pages turned into readable plain text with entities, links, and paragraphs handled consistently, a small third-party converter often beats hand-rolled walks.

github.com/k3a/html2text is one option: go get github.com/k3a/html2text@latest, then call HTML2Text on your HTML string inside a module.

go
package main

import (
	"fmt"

	"github.com/k3a/html2text"
)

func main() {
	doc := `<!DOCTYPE html><html><head><title>Pets</title></head><body>
<p>A list of pets</p><ul><li>dog</li><li>cat</li></ul>
<footer>GoLinuxCloud</footer></body></html>`
	fmt.Print(html2text.HTML2Text(doc))
}

Output layout depends on the library version; expect formatted plain text rather than a single joined line.


Sanitize HTML vs strip HTML tags

Task Purpose
Strip / HTML → text Produce plain text for logs, search, ML, etc.
Sanitize Keep safe HTML for rendering in a browser

For untrusted HTML that will be injected into a page, use a sanitizer such as bluemonday—not regex tag removal. Stripping tags for plain text does not make remaining markup safe to innerHTML.


Common mistakes

Using regex for complex HTML

Prefer x/net/html or html2text for arbitrary documents.

Forgetting to close response.Body

Use defer resp.Body.Close() after checking the error from Get / Do.

Reading very large response bodies into memory

io.ReadAll buffers everything; stream or cap size for big files.

Thinking string(body) means plain text

It only decodes bytes to a UTF-8 string; HTML tags remain.

Losing spaces between elements

Naive stripping or text concatenation can join words; add separators around block elements or use a library.

Confusing sanitization with text extraction

Sanitization is for safe HTML output; text extraction targets plain output.


Go HTML to text cheat sheet

Goal Approach
Response body → string defer resp.Body.Close(); io.ReadAll(resp.Body) then string(b)
Tiny trusted fragment, rough strip Regex replace (limited)
Real HTML → text x/net/html tokenizer or tree walk, or html2text
Text only inside <body> html.Parse, find body, walk text nodes (skip script/style)
Serialize <body> subtree to HTML html.Render on the body node
Readable email / article text html2text-style package
Safe HTML for browser bluemonday (policy-based)
Large response Stream; avoid unbounded ReadAll

Which approach should you use?

You are trying to… Start here
Log or store raw page HTML HTTP ReadAll + string
Quick lab strip on known HTML Regex (then upgrade if inputs grow)
Reliable text or body extraction golang.org/x/net/html
Nice plain text with layout html2text module
User HTML shown on a site bluemonday, not regex

Summary

Reading an HTTP body gives a raw string that may still be full HTML. Stripping tags with regex is a narrow tool; parsing with golang.org/x/net/html lets you target the body, collect text vs render HTML, and skip script/style. Third-party html2text helps when readability matters; bluemonday addresses XSS when HTML is rendered, which is unrelated to plain-text extraction. Pick the shallowest step that matches input trust and output shape, then add streaming or sanitization when requirements grow.


References

Tuan Nguyen

Data Scientist

Proficient in Golang, Python, Java, MongoDB, Selenium, Spring Boot, Kubernetes, Scrapy, API development, Docker, Data Scraping, PrimeFaces, Linux, Data Structures, and Data Mining. With expertise …