怎么用python修改該html頁面？

原有頁面html代碼：

<html xmlns="http://www.w3.org/1999/xhtml">
 <head> 
  <meta charset="utf-8" /> 
  <meta content="pdf2htmlEX" name="generator" /> 
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible" /> 
  <title></title> 
 </head> 
 <body>
  <div class="t m0 x0 h3 y45e ff2 fs1 fc1 sc0 ls0 ws102">
   abcd
   <span class="ws0">adc</span>
  </div>
  <div class="t m0 x0 h3 y45f ff2 fs1 fc1 sc0 ls0 wse">
   ab
  </div>
  <div class="t m0 xd5 hb y4be ff2 fs3 fc1 sc0 ls7 wse3">
   SUP
   <span class="_ _93"> </span>
   OUT
   <span class="_ _a1"> </span>
   OUT
  </div>
  <div class="t m0 xff h3 y4c1 ff2 fs1 fc1 sc0 ls5c ws10b">
   (V
   <span class="_ _54"> </span>
   V
   <span class="_ _a0">b<span class="_ _92">aa</span></span>
   V
  </div>
 </body>
</html>

要用python程序，將該html頁面修改為如下模樣：

<html xmlns="http://www.w3.org/1999/xhtml">
 <head> 
  <meta charset="utf-8" /> 
  <meta content="pdf2htmlEX" name="generator" /> 
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible" /> 
  <title></title> 
 </head> 
 <body>
  <div class="t m0 x0 h3 y45e ff2 fs1 fc1 sc0 ls0 ws102">
   4
   <span class="ws0">3</span>
  </div>
  <div class="t m0 x0 h3 y45f ff2 fs1 fc1 sc0 ls0 wse">
   2
  </div>
  <div class="t m0 xd5 hb y4be ff2 fs3 fc1 sc0 ls7 wse3">
   3
   <span class="_ _93">1</span>
   3
   <span class="_ _a1">1</span>
   3
  </div>
  <div class="t m0 xff h3 y4c1 ff2 fs1 fc1 sc0 ls5c ws10b">
   2
   <span class="_ _54">1</span>
   2
   <span class="_ _a0">1<span class="_ _92">2</span></span>
   1
  </div>
 </body>
</html>

對比兩個頁面代碼，可以看到，是要將每一個標(biāo)簽內(nèi)的每一個text替換為該text的位數(shù)，同時要保證原有的dom結(jié)構(gòu)與標(biāo)簽屬性不發(fā)生任何改變，最后要將結(jié)果保存為新頁面。

我用beautifulsoup怎么搞也搞不出來，是這個需求太怪異了嗎？求大神幫助。（上面的頁面只是示例，真實頁面dom結(jié)構(gòu)嵌套更多，硬編碼是無意義的。）

回答

編輯回答

咕嚕嚕

去找一個html解析器，轉(zhuǎn)化后的結(jié)構(gòu)找到text節(jié)點，替換成文本的長度

2017年12月5日 00:30

編輯回答

情皺

建議用javascript啊, 不能再簡單了

瀏覽器F12, 粘貼下面的代碼到console里.

function walk(node, fn) {
    if (node) do {
            fn(node);
            walk(node.firstChild, fn);       
    } while (node = node.nextSibling);
}
 
walk(document.body, function(node) {
        if(node.nodeType==1 || node.nodeType==3){       
                console.log(node.nodeValue);   
                node.nodeValue = (node.nodeValue+"").length;       
    }
});

有圖有真相:

圖片描述

2018年6月3日 10:29

編輯回答

失魂人

遞歸解析，再重新構(gòu)造，用 lxml http://lxml.de/

2017年12月21日 05:26

編輯回答

負(fù)我心

import re

def f(m):
    s = m.group(1)
    length = len(s.strip())
    if length == 0:
        return '>{}<'.format(s)
    return '>{}<'.format(re.sub('\S+.?\S?', str(length), s))

p = re.compile('>(.*?)<', re.S)
print(p.sub(f, html))

2017年4月1日 09:40

編輯回答

艷骨

import re
with open('1.html', 'r') as r:
    txt = ''.join(r.readlines())

print(txt)  # 你原始的html文本

def replace(match):
    t, s = match.group(1), match.group(1).strip()
    return '>%s<' % (t.replace(s, str(len(s))) if s else t)


txt1 = re.sub(r'>([.\S\s]*?)<', replace, txt)

print(txt1)  # 轉(zhuǎn)換后的html

2017年8月9日 22:06