Skip to main content
blog title image

1 minute read - JSoup Java For Testers

JSoup Tip How to get raw element text with newlines in Java - Parsing HTML and XML with JSoup

Apr 13, 2017

TL;DR with JSoup either switch off document pretty printing or use textNodes to pull the raw text from an element.


thumbnail of video

A quick tip for JSoup.

I wanted to pull out the raw text from an HTML element and retain the \n newline characters. But HTML doesn’t care about those so JSOUP normally parses them away.

I found two ways to access them.

  • switching off pretty printing
  • using the textNodes

Switching off Pretty Printing

When you parse a document in JSoup you can switch off the prettyPrint

Document doc = Jsoup.parse(filename, "UTF-8", "http://example.com/");
doc.outputSettings().prettyPrint(false);

Then when you access the html or other text in an element you can find all the \n characters in the text.

String textA = element.html();

Use the textNodes

This approach works regardless of whether you have prettyPrint on or off:

String text = "";
for(TextNode node : element.textNodes()){
    text = text + node + "\n\n";
}

If you accidentally use both methods then you might get confused.

I think I prefer the second approach because it works regardless.

You can find code that illustrates this on github in the TwineSugarCubeReader.java file

See also the accompanying YouTube Video: