<dependency>
  <!-- jsoup HTML parser library @ http://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.10.2</version>
</dependency>

JSoup應(yīng)用的主要類

雖然完整的類庫中有很多類，但大多數(shù)情況下，下面給出3個(gè)類是我們需要重點(diǎn)了解的。

1. org.jsoup.Jsoup類

Jsoup類是任何Jsoup程序的入口點(diǎn)，并將提供從各種來源加載和解析HTML文檔的方法。

Jsoup類的一些重要方法如下：

方法	描述
`static Connection connect(String url)`	創(chuàng)建并返回URL的連接。
`static Document parse(File in, String charsetName)`	將指定的字符集文件解析成文檔。
`static Document parse(String html)`	將給定的html代碼解析成文檔。
`static String clean(String bodyHtml, Whitelist whitelist)`	從輸入HTML返回安全的HTML，通過解析輸入HTML并通過允許的標(biāo)簽和屬性的白名單進(jìn)行過濾。

2. org.jsoup.nodes.Document類

該類表示通過Jsoup庫加載HTML文檔?？梢允褂么祟悎?zhí)行適用于整個(gè)HTML文檔的操作。

Element類的重要方法可以參見 - http://jsoup.org/apidocs/org/jsoup/nodes/Document.html 。

3. org.jsoup.nodes.Element類

HTML元素是由標(biāo)簽名稱，屬性和子節(jié)點(diǎn)組成。使用Element類，您可以提取數(shù)據(jù)，遍歷節(jié)點(diǎn)和操作HTML。

Element類的重要方法可參見 - http://jsoup.org/apidocs/org/jsoup/nodes/Element.html 。

應(yīng)用實(shí)例

現(xiàn)在我們來看一些使用Jsoup API處理HTML文檔的例子。

1. 載入文件

從URL加載文檔，使用Jsoup.connect()方法從URL加載HTML。

try
{
    Document document = Jsoup.connect("http://www.yiibai.com").get();
    System.out.println(document.title());
} 
catch (IOException e) 
{
    e.printStackTrace();
}

2. 從文件加載文檔

使用Jsoup.parse()方法從文件加載HTML。

try
{
    Document document = Jsoup.parse( new File( "D:/temp/index.html" ) , "utf-8" );
    System.out.println(document.title());
} 
catch (IOException e) 
{
    e.printStackTrace();
}

3. 從String加載文檔

使用Jsoup.parse()方法從字符串加載HTML。

try
{
    String html = "<html><head><title>First parse</title></head>"
                    + "<body><p>Parsed HTML into a doc.</p></body></html>";
    Document document = Jsoup.parse(html);
    System.out.println(document.title());
} 
catch (IOException e) 
{
    e.printStackTrace();
}

4. 從HTML獲取標(biāo)題

如上圖所示，調(diào)用document.title()方法獲取HTML頁面的標(biāo)題。

try
{
    Document document = Jsoup.parse( new File("C:/Users/xyz/Desktop/yiibai-index.html"), "utf-8");
    System.out.println(document.title());
} 
catch (IOException e) 
{
    e.printStackTrace();
}

5. 獲取HTML頁面的Fav圖標(biāo)

假設(shè)favicon圖像將是HTML文檔的<head>部分中的第一個(gè)圖像，您可以使用下面的代碼。

String favImage = "Not Found";
try {
    Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/yiibai-index.html"), "utf-8");
    Element element = document.head().select("link[href~=.*\\.(ico|png)]").first();
    if (element == null) 
    {
        element = document.head().select("meta[itemprop=image]").first();
        if (element != null) 
        {
            favImage = element.attr("content");
        }
    } 
    else
    {
        favImage = element.attr("href");
    }
} 
catch (IOException e) 
{
    e.printStackTrace();
}
System.out.println(favImage);

6. 獲取HTML頁面中的所有鏈接

要獲取網(wǎng)頁中的所有鏈接，請(qǐng)使用以下代碼。

try
{
    Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/yiibai-index.html"), "utf-8");
    Elements links = document.select("a[href]");  
    for (Element link : links) 
    {
         System.out.println("link : " + link.attr("href"));  
         System.out.println("text : " + link.text());  
    }
} 
catch (IOException e) 
{
    e.printStackTrace();
}

7. 獲取HTML頁面中的所有圖像

要獲取網(wǎng)頁中顯示的所有圖像，請(qǐng)使用以下代碼。

try
{
    Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/yiibai-index.html"), "utf-8");
    Elements images = document.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
    for (Element image : images) 
    {
        System.out.println("src : " + image.attr("src"));
        System.out.println("height : " + image.attr("height"));
        System.out.println("width : " + image.attr("width"));
        System.out.println("alt : " + image.attr("alt"));
    }
} 
catch (IOException e) 
{
    e.printStackTrace();
}

8. 獲取URL的元信息

元信息包括Google等搜索引擎用來確定網(wǎng)頁內(nèi)容的索引為目的。它們以HTML頁面的HEAD部分中的一些標(biāo)簽的形式存在。要獲取有關(guān)網(wǎng)頁的元信息，請(qǐng)使用下面的代碼。

try
{
    Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/yiibai-index.html"), "utf-8");

    String description = document.select("meta[name=description]").get(0).attr("content");  
    System.out.println("Meta description : " + description);  

    String keywords = document.select("meta[name=keywords]").first().attr("content");  
    System.out.println("Meta keyword : " + keywords);  
} 
catch (IOException e) 
{
    e.printStackTrace();
}

9. 在HTML頁面中獲取表單屬性

在網(wǎng)頁中獲取表單輸入元素非常簡單。使用唯一ID查找FORM元素; 然后找到該表單中存在的所有INPUT元素。

Document doc = Jsoup.parse(new File("c:/temp/yiibai-index.html"),"utf-8");  
Element formElement = doc.getElementById("loginForm");  

Elements inputElements = formElement.getElementsByTag("input");  
for (Element inputElement : inputElements) {  
    String key = inputElement.attr("name");  
    String value = inputElement.attr("value");  
    System.out.println("Param name: "+key+" \nParam value: "+value);  
}

10. 更新元素的屬性/內(nèi)容

只要您使用上述方法找到您想要的元素; 可以使用Jsoup API來更新這些元素的屬性或innerHTML。例如，想更新文檔中存在的“rel = nofollow”的所有鏈接。

try
{
    Document document = Jsoup.parse(new File("C:/Users/zkpkhua/Desktop/yiibai.com.html"), "utf-8");
    Elements links = document.select("a[href]");  
    links.attr("rel", "nofollow");
} 
catch (IOException e) 
{
    e.printStackTrace();
}

10. 消除不信任的HTML(以防止XSS)

假設(shè)在應(yīng)用程序中，想顯示用戶提交的HTML片段。例如用戶可以在評(píng)論框中放入HTML內(nèi)容。這可能會(huì)導(dǎo)致非常嚴(yán)重的問題，如果您允許直接顯示此HTML。用戶可以在其中放入一些惡意腳本，并將用戶重定向到另一個(gè)臟網(wǎng)站。

為了清理這個(gè)HTML，Jsoup提供Jsoup.clean()方法。此方法期望HTML格式的字符串，并將返回清潔的HTML。要執(zhí)行此任務(wù)，Jsoup使用白名單過濾器。 jsoup白名單過濾器通過解析輸入HTML(在安全的沙盒環(huán)境中)工作，然后遍歷解析樹，只允許將已知安全的標(biāo)簽和屬性(和值)通過清理后輸出。

它不使用正則表達(dá)式，這對(duì)于此任務(wù)是不合適的。

清潔器不僅用于避免XSS，還限制了用戶可以提供的元素的范圍：您可以使用文本，強(qiáng)元素，但不能構(gòu)造div或表元素。

String dirtyHTML = "<p><a href='http://www.yiibai.com/' onclick='sendCookiesToMe()'>Link</a></p>";

String cleanHTML = Jsoup.clean(dirtyHTML, Whitelist.basic());

System.out.println(cleanHTML);

執(zhí)行后輸出結(jié)果如下 -

<p><a href="http://www.yiibai.com/" rel="nofollow">Link</a></p>

上一篇：JSoup安裝下一篇：Jsoup示例：提取表單參數(shù)