Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
481 views
in Technique[技术] by (71.8m points)

javascript - What is the best way to parse html in google apps script

var page = UrlFetchApp.fetch(contestURL);
var doc = XmlService.parse(page);

The above code gives a parse error when used, however if I replace the XmlService class with the deprecated Xml class, with the lenient flag set, it parses the html properly.

var page = UrlFetchApp.fetch(contestURL);
var doc = Xml.parse(page, true);

The problem is mostly caused because of no CDATA in the javascript part of the html and the parser complains with the following error.

The entity name must immediately follow the '&' in the entity reference.

Even if I remove all the <script>(.*?)</script> using regex, it still complains because the <br> tags aren't closed. Is there a clean way of parsing html into a DOM tree.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I ran into this exact same problem. I was able to circumvent it by first using the deprecated Xml.parse, since it still works, then selecting the body XmlElement, then passing in its Xml String into the new XmlService.parse method:

var page = UrlFetchApp.fetch(contestURL);
var doc = Xml.parse(page, true);
var bodyHtml = doc.html.body.toXmlString();
doc = XmlService.parse(bodyHtml);
var root = doc.getRootElement();

Note: This solution may not work if the old Xml.parse is completely removed from Google Scripts.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to WuJiGu Developer Q&A Community for programmer and developer-Open, Learning and Share
...