This page ties together tasks people mix up: turning an HTTP response body into a Go string, stripping tags for rough plain text, parsing HTML to find the <body>, and optionally serializing body markup again. Those are different steps in the same pipeline. For regular-expression basics used in tiny strippers, see regular expressions in Go.
Tested on: Go 1.22, 64-bit Linux. Snippets that import
golang.org/x/net/html, third-party modules, or perform real HTTP requests are marked{run=false}and are meant to run inside a small module on your machine (go mod init, thengo getas needed).
Quick answer: body bytes, raw HTML string, then meaning
To put an HTTP response body in a string, read resp.Body with io.ReadAll, handle errors, and convert with string(body). That string still contains markup if the server returned HTML.
To produce plain text from HTML, parse with golang.org/x/net/html (tokenizer or html.Parse) or use a dedicated html2text library. Regex can strip simple tags on small, trusted fragments but is not a general HTML solution.
Convert HTML to text in Go
Start with what you actually need:
| Goal | What it means |
|---|---|
Response body → string |
Bytes from the wire as a Go string (often still HTML) |
| Strip HTML tags | Remove <…> segments to approximate plain text |
| HTML → readable text | Decode entities, spacing, lists, links—usually needs a parser or library |
| Body text | Plain text taken from nodes inside <body> |
| Body HTML | Markup inside or including <body> serialized back to a string |
| Sanitize HTML | Allow safe tags/attrs for browser output—different from “text only” |
Example: for <p>Hello <b>Go</b></p>, plain text might be Hello Go (with spaces handled by your extractor). Body HTML might be the inner <p>…</p> or the full <body>…</body> depending on what you serialize.
When you only need plain text
Prefer a tokenizer walk, a tree walk after html.Parse, or an html2text package when layout and entities matter.
When you should parse HTML instead of using regex
HTML is not a regular language. Malformed tags, attributes containing >, comments, CDATA, and <script> / <style> blocks break naive patterns. Use x/net/html or a library for real pages.
Read HTTP response body as string
This matches searches such as golang http body to string and golang response body to string: you only move bytes into memory and cast to string. Tags remain until you strip or parse. For building HTTP clients and handlers, see Golang HTTP.
Use io.ReadAll(resp.Body) after a successful Get or Do, check errors, then string(body). Always defer resp.Body.Close() so the client can reuse connections. io.ReadAll loads the entire body into memory; for large downloads, stream with io.Copy to a file or cap with io.LimitReader.
package main
import (
"fmt"
"io"
"net/http"
"strings"
"time"
)
func main() {
client := &http.Client{Timeout: 10 * time.Second}
resp, err := client.Get("https://example.com/")
if err != nil {
panic(err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
panic(err)
}
html := string(body)
prefix := strings.TrimSpace(html)
if len(prefix) > 60 {
prefix = prefix[:60] + "…"
}
fmt.Println("bytes:", len(body), "prefix:", prefix)
}Run locally with network access; you should see a byte count and a short ASCII prefix of the document. The string still includes <!DOCTYPE, tags, and entities—reading the body is not HTML-to-text by itself.
Strip HTML tags from a string
Simple tag removal for trusted small input
A pattern such as <[^>]*> with regexp.ReplaceAllString can strip angle-bracket chunks on tight, trusted snippets. Treat the result as best-effort.
package main
import (
"fmt"
"regexp"
)
func main() {
re := regexp.MustCompile(`<[^>]*>`)
html := `<div><h1>GoLinuxCloud</h1><p>This is an html document!</p></div>`
plain := re.ReplaceAllString(html, "")
fmt.Println(plain)
}Run removes tags; adjacent text can glue together (Hello</p><p>World → HelloWorld) unless you insert spaces or use a parser-aware extractor.
Why regex is not reliable for real HTML
Nested tags, > inside attributes, SGML-style oddities, and executable regions inside <script> make regex stripping unsafe as a general strategy. Escalate to x/net/html or html2text when the input is arbitrary.
Parse HTML using golang.org/x/net/html
The extended library provides an HTML5 tokenizer and html.Parse for a html.Node tree.
| Approach | Best for |
|---|---|
html.NewTokenizer |
Streaming scan, collecting TextToken payloads |
html.Parse |
Finding <body>, <title>, walking children, rendering subtrees |
Add the module to your project: go get golang.org/x/net/html.
Use tokenizer for text extraction
package main
import (
"fmt"
"strings"
"golang.org/x/net/html"
)
func textTokens(htmlInput string) []string {
tkn := html.NewTokenizer(strings.NewReader(htmlInput))
var out []string
for {
switch tkn.Next() {
case html.ErrorToken:
return out
case html.TextToken:
t := tkn.Token()
if s := strings.TrimSpace(t.Data); s != "" {
out = append(out, s)
}
}
}
}
func main() {
doc := `<!DOCTYPE html><html><head><title>Pets</title></head><body>
<p>A list of pets</p><ul><li>dog</li><li>cat</li></ul>
<footer>GoLinuxCloud</footer></body></html>`
fmt.Println(textTokens(doc))
}You should see a slice of non-empty fragments (including <title> text unless you filter by context). For production text, skip script, style, and head tokens or switch to a tree walk.
Use node parser when you need the body element
html.Parse returns the document root; descend to locate the <body> element and recurse only inside it for text or rendering.
Extract text from the HTML body tag
Typical flow: parse → find <body> → walk child nodes → collect html.TextNode data → normalize spaces.
Skip or strip script, style, and usually head content when building reader-facing text. Between block elements (p, div, br, li, headings), insert spaces or newlines so <p>Hello</p><p>World</p> does not become HelloWorld.
package main
import (
"fmt"
"strings"
"golang.org/x/net/html"
)
func findBody(n *html.Node) *html.Node {
if n.Type == html.ElementNode && n.Data == "body" {
return n
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
if b := findBody(c); b != nil {
return b
}
}
return nil
}
func textFromBody(body *html.Node) string {
var b strings.Builder
var walk func(*html.Node)
walk = func(n *html.Node) {
switch n.Type {
case html.TextNode:
b.WriteString(n.Data)
case html.ElementNode:
switch n.Data {
case "script", "style":
return
case "p", "div", "br", "li", "h1", "h2", "h3", "h4", "h5", "h6":
if b.Len() > 0 {
b.WriteByte(' ')
}
}
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
walk(c)
}
}
walk(body)
return strings.Join(strings.Fields(b.String()), " ")
}
func main() {
doc := `<!DOCTYPE html><html><head><title>X</title></head><body><p>Hello</p><p>World</p></body></html>`
root, err := html.Parse(strings.NewReader(doc))
if err != nil {
panic(err)
}
body := findBody(root)
if body == nil {
panic("no body")
}
fmt.Println(textFromBody(body))
}Run with go get golang.org/x/net/html; you should see Hello World with a separating space instead of HelloWorld.
Extract the HTML body as a string
This is not the same as plain text: you keep tags and serialize a subtree.
| Requirement | Typical output |
|---|---|
Text inside <body> |
Plain string |
| Inner body HTML | Serialized children of <body> |
Full <body>…</body> |
Include the body element in rendering |
html.Render writes a *html.Node subtree to an io.Writer. Render the <body> node for the outer tags, or loop FirstChild on body if you want inner HTML only.
package main
import (
"bytes"
"fmt"
"strings"
"golang.org/x/net/html"
)
func findBody(n *html.Node) *html.Node {
if n.Type == html.ElementNode && n.Data == "body" {
return n
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
if b := findBody(c); b != nil {
return b
}
}
return nil
}
func main() {
doc := `<!DOCTYPE html><html><head></head><body><p id="x">Hi</p></body></html>`
root, err := html.Parse(strings.NewReader(doc))
if err != nil {
panic(err)
}
body := findBody(root)
var buf bytes.Buffer
if err := html.Render(&buf, body); err != nil {
panic(err)
}
fmt.Println(buf.String())
}You should get a string starting with <body> that still contains the <p> markup—suitable for storage or further processing, not for “human plain text.”
Use an html2text package
When you want emails, CMS HTML, or scraped pages turned into readable plain text with entities, links, and paragraphs handled consistently, a small third-party converter often beats hand-rolled walks.
github.com/k3a/html2text is one option: go get github.com/k3a/html2text@latest, then call HTML2Text on your HTML string inside a module.
package main
import (
"fmt"
"github.com/k3a/html2text"
)
func main() {
doc := `<!DOCTYPE html><html><head><title>Pets</title></head><body>
<p>A list of pets</p><ul><li>dog</li><li>cat</li></ul>
<footer>GoLinuxCloud</footer></body></html>`
fmt.Print(html2text.HTML2Text(doc))
}Output layout depends on the library version; expect formatted plain text rather than a single joined line.
Sanitize HTML vs strip HTML tags
| Task | Purpose |
|---|---|
| Strip / HTML → text | Produce plain text for logs, search, ML, etc. |
| Sanitize | Keep safe HTML for rendering in a browser |
For untrusted HTML that will be injected into a page, use a sanitizer such as bluemonday—not regex tag removal. Stripping tags for plain text does not make remaining markup safe to innerHTML.
Common mistakes
Using regex for complex HTML
Prefer x/net/html or html2text for arbitrary documents.
Forgetting to close response.Body
Use defer resp.Body.Close() after checking the error from Get / Do.
Reading very large response bodies into memory
io.ReadAll buffers everything; stream or cap size for big files.
Thinking string(body) means plain text
It only decodes bytes to a UTF-8 string; HTML tags remain.
Losing spaces between elements
Naive stripping or text concatenation can join words; add separators around block elements or use a library.
Confusing sanitization with text extraction
Sanitization is for safe HTML output; text extraction targets plain output.
Go HTML to text cheat sheet
| Goal | Approach |
|---|---|
Response body → string |
defer resp.Body.Close(); io.ReadAll(resp.Body) then string(b) |
| Tiny trusted fragment, rough strip | Regex replace (limited) |
| Real HTML → text | x/net/html tokenizer or tree walk, or html2text |
Text only inside <body> |
html.Parse, find body, walk text nodes (skip script/style) |
Serialize <body> subtree to HTML |
html.Render on the body node |
| Readable email / article text | html2text-style package |
| Safe HTML for browser | bluemonday (policy-based) |
| Large response | Stream; avoid unbounded ReadAll |
Which approach should you use?
| You are trying to… | Start here |
|---|---|
| Log or store raw page HTML | HTTP ReadAll + string |
| Quick lab strip on known HTML | Regex (then upgrade if inputs grow) |
| Reliable text or body extraction | golang.org/x/net/html |
| Nice plain text with layout | html2text module |
| User HTML shown on a site | bluemonday, not regex |
Summary
Reading an HTTP body gives a raw string that may still be full HTML. Stripping tags with regex is a narrow tool; parsing with golang.org/x/net/html lets you target the body, collect text vs render HTML, and skip script/style. Third-party html2text helps when readability matters; bluemonday addresses XSS when HTML is rendered, which is unrelated to plain-text extraction. Pick the shallowest step that matches input trust and output shape, then add streaming or sanitization when requirements grow.

