Golang HTML to Text: Strip Tags, Parse HTML, and Read Response Body

This page ties together tasks people mix up: turning an HTTP response body into a Go string, stripping tags for rough plain text, parsing HTML to find the <body>, and optionally serializing body markup again. Those are different steps in the same pipeline. For regular-expression basics used in tiny strippers, see regular expressions in Go.

Tested on: Go 1.22, 64-bit Linux. Snippets that import golang.org/x/net/html, third-party modules, or perform real HTTP requests are marked {run=false} and are meant to run inside a small module on your machine (go mod init, then go get as needed).

Quick answer: body bytes, raw HTML string, then meaning

To put an HTTP response body in a string, read resp.Body with io.ReadAll, handle errors, and convert with string(body). That string still contains markup if the server returned HTML.

To produce plain text from HTML, parse with golang.org/x/net/html (tokenizer or html.Parse) or use a dedicated html2text library. Regex can strip simple tags on small, trusted fragments but is not a general HTML solution.

Convert HTML to text in Go

Start with what you actually need:

Goal	What it means
Response body → `string`	Bytes from the wire as a Go string (often still HTML)
Strip HTML tags	Remove `<…>` segments to approximate plain text
HTML → readable text	Decode entities, spacing, lists, links—usually needs a parser or library
Body text	Plain text taken from nodes inside `<body>`
Body HTML	Markup inside or including `<body>` serialized back to a string
Sanitize HTML	Allow safe tags/attrs for browser output—different from “text only”

Example: for <p>Hello <b>Go</b></p>, plain text might be Hello Go (with spaces handled by your extractor). Body HTML might be the inner <p>…</p> or the full <body>…</body> depending on what you serialize.

When you only need plain text

Prefer a tokenizer walk, a tree walk after html.Parse, or an html2text package when layout and entities matter.

When you should parse HTML instead of using regex

HTML is not a regular language. Malformed tags, attributes containing >, comments, CDATA, and <script> / <style> blocks break naive patterns. Use x/net/html or a library for real pages.

Read HTTP response body as string

This matches searches such as golang http body to string and golang response body to string: you only move bytes into memory and cast to string. Tags remain until you strip or parse. For building HTTP clients and handlers, see Golang HTTP.

Use io.ReadAll(resp.Body) after a successful Get or Do, check errors, then string(body). Always defer resp.Body.Close() so the client can reuse connections. io.ReadAll loads the entire body into memory; for large downloads, stream with io.Copy to a file or cap with io.LimitReader.

go


package main

import (
	"fmt"
	"io"
	"net/http"
	"strings"
	"time"
)

func main() {
	client := &http.Client{Timeout: 10 * time.Second}
	resp, err := client.Get("https://example.com/")
	if err != nil {
		panic(err)
	}
	defer resp.Body.Close()
	body, err := io.ReadAll(resp.Body)
	if err != nil {
		panic(err)
	}
	html := string(body)
	prefix := strings.TrimSpace(html)
	if len(prefix) > 60 {
		prefix = prefix[:60] + "…"
	}
	fmt.Println("bytes:", len(body), "prefix:", prefix)
}

Run locally with network access; you should see a byte count and a short ASCII prefix of the document. The string still includes <!DOCTYPE, tags, and entities—reading the body is not HTML-to-text by itself.

Strip HTML tags from a string

Simple tag removal for trusted small input

A pattern such as <[^>]*> with regexp.ReplaceAllString can strip angle-bracket chunks on tight, trusted snippets. Treat the result as best-effort.

go


package main

import (
	"fmt"
	"regexp"
)

func main() {
	re := regexp.MustCompile(`<[^>]*>`)
	html := `<div><h1>GoLinuxCloud</h1><p>This is an html document!</p></div>`
	plain := re.ReplaceAllString(html, "")
	fmt.Println(plain)
}

Output

Run removes tags; adjacent text can glue together (Hello</p><p>World → HelloWorld) unless you insert spaces or use a parser-aware extractor.

Why regex is not reliable for real HTML

Nested tags, > inside attributes, SGML-style oddities, and executable regions inside <script> make regex stripping unsafe as a general strategy. Escalate to x/net/html or html2text when the input is arbitrary.

Parse HTML using golang.org/x/net/html

The extended library provides an HTML5 tokenizer and html.Parse for a html.Node tree.

Approach	Best for
`html.NewTokenizer`	Streaming scan, collecting `TextToken` payloads
`html.Parse`	Finding `<body>`, `<title>`, walking children, rendering subtrees

Add the module to your project: go get golang.org/x/net/html.

Use tokenizer for text extraction

go


package main

import (
	"fmt"
	"strings"

	"golang.org/x/net/html"
)

func textTokens(htmlInput string) []string {
	tkn := html.NewTokenizer(strings.NewReader(htmlInput))
	var out []string
	for {
		switch tkn.Next() {
		case html.ErrorToken:
			return out
		case html.TextToken:
			t := tkn.Token()
			if s := strings.TrimSpace(t.Data); s != "" {
				out = append(out, s)
			}
		}
	}
}

func main() {
	doc := `<!DOCTYPE html><html><head><title>Pets</title></head><body>
<p>A list of pets</p><ul><li>dog</li><li>cat</li></ul>
<footer>GoLinuxCloud</footer></body></html>`
	fmt.Println(textTokens(doc))
}

You should see a slice of non-empty fragments (including <title> text unless you filter by context). For production text, skip script, style, and head tokens or switch to a tree walk.

Use node parser when you need the body element

html.Parse returns the document root; descend to locate the <body> element and recurse only inside it for text or rendering.

Extract text from the HTML body tag

Typical flow: parse → find <body> → walk child nodes → collect html.TextNode data → normalize spaces.

Skip or strip script, style, and usually head content when building reader-facing text. Between block elements (p, div, br, li, headings), insert spaces or newlines so <p>Hello</p><p>World</p> does not become HelloWorld.

go


package main

import (
	"fmt"
	"strings"

	"golang.org/x/net/html"
)

func findBody(n *html.Node) *html.Node {
	if n.Type == html.ElementNode && n.Data == "body" {
		return n
	}
	for c := n.FirstChild; c != nil; c = c.NextSibling {
		if b := findBody(c); b != nil {
			return b
		}
	}
	return nil
}

func textFromBody(body *html.Node) string {
	var b strings.Builder
	var walk func(*html.Node)
	walk = func(n *html.Node) {
		switch n.Type {
		case html.TextNode:
			b.WriteString(n.Data)
		case html.ElementNode:
			switch n.Data {
			case "script", "style":
				return
			case "p", "div", "br", "li", "h1", "h2", "h3", "h4", "h5", "h6":
				if b.Len() > 0 {
					b.WriteByte(' ')
				}
			}
		}
		for c := n.FirstChild; c != nil; c = c.NextSibling {
			walk(c)
		}
	}
	walk(body)
	return strings.Join(strings.Fields(b.String()), " ")
}

func main() {
	doc := `<!DOCTYPE html><html><head><title>X</title></head><body><p>Hello</p><p>World</p></body></html>`
	root, err := html.Parse(strings.NewReader(doc))
	if err != nil {
		panic(err)
	}
	body := findBody(root)
	if body == nil {
		panic("no body")
	}
	fmt.Println(textFromBody(body))
}

Run with go get golang.org/x/net/html; you should see Hello World with a separating space instead of HelloWorld.

Extract the HTML body as a string

This is not the same as plain text: you keep tags and serialize a subtree.

Requirement	Typical output
Text inside `<body>`	Plain string
Inner body HTML	Serialized children of `<body>`
Full `<body>…</body>`	Include the body element in rendering

html.Render writes a *html.Node subtree to an io.Writer. Render the <body> node for the outer tags, or loop FirstChild on body if you want inner HTML only.

go


package main

import (
	"bytes"
	"fmt"
	"strings"

	"golang.org/x/net/html"
)

func findBody(n *html.Node) *html.Node {
	if n.Type == html.ElementNode && n.Data == "body" {
		return n
	}
	for c := n.FirstChild; c != nil; c = c.NextSibling {
		if b := findBody(c); b != nil {
			return b
		}
	}
	return nil
}

func main() {
	doc := `<!DOCTYPE html><html><head></head><body><p id="x">Hi</p></body></html>`
	root, err := html.Parse(strings.NewReader(doc))
	if err != nil {
		panic(err)
	}
	body := findBody(root)
	var buf bytes.Buffer
	if err := html.Render(&buf, body); err != nil {
		panic(err)
	}
	fmt.Println(buf.String())
}

You should get a string starting with <body> that still contains the <p> markup—suitable for storage or further processing, not for “human plain text.”

Use an html2text package

When you want emails, CMS HTML, or scraped pages turned into readable plain text with entities, links, and paragraphs handled consistently, a small third-party converter often beats hand-rolled walks.

github.com/k3a/html2text is one option: go get github.com/k3a/html2text@latest, then call HTML2Text on your HTML string inside a module.

go


package main

import (
	"fmt"

	"github.com/k3a/html2text"
)

func main() {
	doc := `<!DOCTYPE html><html><head><title>Pets</title></head><body>
<p>A list of pets</p><ul><li>dog</li><li>cat</li></ul>
<footer>GoLinuxCloud</footer></body></html>`
	fmt.Print(html2text.HTML2Text(doc))
}

Output layout depends on the library version; expect formatted plain text rather than a single joined line.

Sanitize HTML vs strip HTML tags

Task	Purpose
Strip / HTML → text	Produce plain text for logs, search, ML, etc.
Sanitize	Keep safe HTML for rendering in a browser

For untrusted HTML that will be injected into a page, use a sanitizer such as bluemonday—not regex tag removal. Stripping tags for plain text does not make remaining markup safe to innerHTML.

Common mistakes

Using regex for complex HTML

Prefer x/net/html or html2text for arbitrary documents.

Forgetting to close response.Body

Use defer resp.Body.Close() after checking the error from Get / Do.

Reading very large response bodies into memory

io.ReadAll buffers everything; stream or cap size for big files.

Thinking `string(body)` means plain text

It only decodes bytes to a UTF-8 string; HTML tags remain.

Losing spaces between elements

Naive stripping or text concatenation can join words; add separators around block elements or use a library.

Confusing sanitization with text extraction

Sanitization is for safe HTML output; text extraction targets plain output.

Go HTML to text cheat sheet

Goal	Approach
Response body → `string`	`defer resp.Body.Close(); io.ReadAll(resp.Body)` then `string(b)`
Tiny trusted fragment, rough strip	Regex replace (limited)
Real HTML → text	`x/net/html` tokenizer or tree walk, or html2text
Text only inside `<body>`	`html.Parse`, find `body`, walk text nodes (skip `script`/`style`)
Serialize `<body>` subtree to HTML	`html.Render` on the body node
Readable email / article text	html2text-style package
Safe HTML for browser	bluemonday (policy-based)
Large response	Stream; avoid unbounded `ReadAll`

Which approach should you use?

You are trying to…	Start here
Log or store raw page HTML	HTTP `ReadAll` + `string`
Quick lab strip on known HTML	Regex (then upgrade if inputs grow)
Reliable text or body extraction	`golang.org/x/net/html`
Nice plain text with layout	html2text module
User HTML shown on a site	bluemonday, not regex

Summary

Reading an HTTP body gives a raw string that may still be full HTML. Stripping tags with regex is a narrow tool; parsing with golang.org/x/net/html lets you target the body, collect text vs render HTML, and skip script/style. Third-party html2text helps when readability matters; bluemonday addresses XSS when HTML is rendered, which is unrelated to plain-text extraction. Pick the shallowest step that matches input trust and output shape, then add streaming or sanitization when requirements grow.

Golang HTML to Text: Strip Tags, Parse HTML, and Convert HTTP Body

Quick answer: body bytes, raw HTML string, then meaning

Convert HTML to text in Go

When you only need plain text

When you should parse HTML instead of using regex

Read HTTP response body as string

Strip HTML tags from a string

Simple tag removal for trusted small input

Why regex is not reliable for real HTML

Parse HTML using golang.org/x/net/html

Use tokenizer for text extraction

Use node parser when you need the body element

Extract text from the HTML body tag

Extract the HTML body as a string

Use an html2text package

Sanitize HTML vs strip HTML tags

Common mistakes

Using regex for complex HTML

Forgetting to close response.Body

Reading very large response bodies into memory

Thinking `string(body)` means plain text

Losing spaces between elements

Confusing sanitization with text extraction

Go HTML to text cheat sheet

Which approach should you use?

Summary

References

Quick answer: body bytes, raw HTML string, then meaning

Convert HTML to text in Go

When you only need plain text

When you should parse HTML instead of using regex

Read HTTP response body as string

Strip HTML tags from a string

Simple tag removal for trusted small input

Why regex is not reliable for real HTML

Parse HTML using golang.org/x/net/html

Use tokenizer for text extraction

Use node parser when you need the body element

Extract text from the HTML body tag

Extract the HTML body as a string

Use an html2text package

Sanitize HTML vs strip HTML tags

Common mistakes

Using regex for complex HTML

Forgetting to close response.Body

Reading very large response bodies into memory

Thinking string(body) means plain text

Losing spaces between elements

Confusing sanitization with text extraction

Go HTML to text cheat sheet

Which approach should you use?

Summary

References

Related Articles

Golang URL Encode Decode: QueryEscape, PathEscape, and Query Params

Golang Tutorial for Beginners (Hands-On)

Install Go on Ubuntu: apt, snap, or official tarball (with checksum)

Search GoLinuxCloud

Thinking `string(body)` means plain text