没有记录,就没有发生。

0%

Python对json嵌套引号的处理

今天在写煎蛋的爬虫,解析吐槽返回的json发现出错:

>>> b=json.loads(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\hly20\AppData\Local\Programs\Python\Python37\lib\json\__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "C:\Users\hly20\AppData\Local\Programs\Python\Python37\lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\hly20\AppData\Local\Programs\Python\Python37\lib\json\decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 198 (char 197)

发现引起错误的json在双引号中嵌套了双引号,如下所示:

{"code":0,"hot_tucao":[{"comment_ID":"5206422","comment_post_ID":"102312","comment_author":"不懂可以不说","comment_date":"2019-06-29 14:12:43","comment_date_int":1561788763,"comment_content":"  \u003ca href=\"#tucao-5206353\" data-id=\"5206353\" class=\"tucao-link\"\[email protected]乐色分类\u003c/a\u003e 很简单。我喜欢她 = 我恋爱了。","comment_parent":"4286850","comment_reply_ID":"0","is_jandan_user":0,"is_tip_user":0,"vote_positive":"11","vote_negative":"0"}],"tucao":[{"comment_ID":"5206353","comment_post_ID":"102312","comment_author":"乐色分类","comment_date":"2019-06-29 14:01:29","comment_date_int":1561788089,"comment_content":"所以你是怎么定义谈恋爱的?看蛋友故事多了我不确定你们对谈恋爱的标准…","comment_parent":"4286850","comment_reply_ID":"0","is_jandan_user":0,"is_tip_user":0,"vote_positive":"4","vote_negative":"0"},{"comment_ID":"5206422","comment_post_ID":"102312","comment_author":"不懂可以不说","comment_date":"2019-06-29 14:12:43","comment_date_int":1561788763,"comment_content":"  \u003ca href=\"#tucao-5206353\" data-id=\"5206353\" class=\"tucao-link\"\[email protected]乐色分类\u003c/a\u003e 很简单。我喜欢她 = 我恋爱了。","comment_parent":"4286850","comment_reply_ID":"0","is_jandan_user":0,"is_tip_user":0,"vote_positive":"12","vote_negative":"0"},{"comment_ID":"5206522","comment_post_ID":"102312","comment_author":"慕行秋","comment_date":"2019-06-29 14:31:29","comment_date_int":1561789889,"comment_content":"  \u003ca href=\"#tucao-5206353\" data-id=\"5206353\" class=\"tucao-link\"\[email protected]乐色分类\u003c/a\u003e 众所周知,蛋友的恋爱==暗恋","comment_parent":"4286850","comment_reply_ID":"0","is_jandan_user":0,"is_tip_user":0,"vote_positive":"3","vote_negative":"0"},{"comment_ID":"5206613","comment_post_ID":"102312","comment_author":"不懂就要说不说怎么知道对错","comment_date":"2019-06-29 14:59:58","comment_date_int":1561791598,"comment_content":" \u003ca href=\"#tucao-5206353\" data-id=\"5206353\" class=\"tucao-link\"\[email protected]乐色分类\u003c/a\u003e 因为不同的人对谈恋爱的定义不一样。\n有的人觉得谈恋爱重点在于谈(婚论嫁),有的人觉得在于恋爱","comment_parent":"4286850","comment_reply_ID":"0","is_jandan_user":0,"is_tip_user":0,"vote_positive":"0","vote_negative":"0"}],"has_next_page":false}

查了下,使用Python处理json字符串中的非法双引号 似乎显示的是同样的问题,但是他的解决方法在这儿似乎行不通,因为他那儿有意义的"情况有限,直接穷举了。 于是换个思路,检测"前后有无 {}[], 来判定引号有没有意义,无意义则在其前加入\,代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def clean_json(data):
# 转义换行符们
data = data.replace("\n", "\\n") \
.replace("\r", "\\r") \
.replace("\n\r", "\\n\\r") \
.replace("\r\n", "\\r\\n") \
.replace("\t", "\\t")
# 确认每个引号有没有意义
start_index = 0
origin_index = data.find('"', start_index)
while origin_index &gt;= 0:
check_index = origin_index - 1
while check_index &gt;= 0:
# 向左搜索
if data[check_index] == ' ':
# 遇到空格跳过
check_index -= 1
continue
elif data[check_index] in '{}:[],':
# 有意义引号,结束当前测试
start_index = origin_index + 1
break
else:
# 左侧没有意义,检测右侧
check_index = origin_index + 1
while check_index &lt; len(data):
if data[check_index] == ' ':
# 遇到空格跳过
check_index += 1
continue
elif data[check_index] in '{}:[],':
# 有意义引号,结束当前测试
start_index = origin_index + 1
break
else:
# 无意义引号,添加 \
data = r'%s\%s' % (data[:origin_index], data[origin_index:])
start_index = origin_index + 2
break
break
origin_index = data.find('"', start_index)
if origin_index == data.rfind('"'):
# 转换完了
break
return(data)

结果如下:

{"code":0,"hot_tucao":[{"comment_ID":"5206422","comment_post_ID":"102312","comment_author":"不懂可以不说","comment_date":"2019-06-29 14:12:43","comment_date_int":1561788763,"comment_content":"  <a href=\"#tucao-5206353\" data-id=\"5206353\" class=\"tucao-link\">@乐色分类</a> 很简单。我喜欢她 = 我恋爱了。","comment_parent":"4286850","comment_reply_ID":"0","is_jandan_user":0,"is_tip_user":0,"vote_positive":"11","vote_negative":"0"}],"tucao":[{"comment_ID":"5206353","comment_post_ID":"102312","comment_author":"乐色分类","comment_date":"2019-06-29 14:01:29","comment_date_int":1561788089,"comment_content":"所以你是怎么定义谈恋爱的?看蛋友故事多了我不确定你们对谈恋爱的标准…","comment_parent":"4286850","comment_reply_ID":"0","is_jandan_user":0,"is_tip_user":0,"vote_positive":"4","vote_negative":"0"},{"comment_ID":"5206422","comment_post_ID":"102312","comment_author":"不懂可以不说","comment_date":"2019-06-29 14:12:43","comment_date_int":1561788763,"comment_content":"  <a href=\"#tucao-5206353\" data-id=\"5206353\" class=\"tucao-link\">@乐色分类</a> 很简单。我喜欢她 = 我恋爱了。","comment_parent":"4286850","comment_reply_ID":"0","is_jandan_user":0,"is_tip_user":0,"vote_positive":"12","vote_negative":"0"},{"comment_ID":"5206522","comment_post_ID":"102312","comment_author":"慕行秋","comment_date":"2019-06-29 14:31:29","comment_date_int":1561789889,"comment_content":"  <a href=\"#tucao-5206353\" data-id=\"5206353\" class=\"tucao-link\">@乐色分类</a> 众所周知,蛋友的恋爱==暗恋","comment_parent":"4286850","comment_reply_ID":"0","is_jandan_user":0,"is_tip_user":0,"vote_positive":"3","vote_negative":"0"},{"comment_ID":"5206613","comment_post_ID":"102312","comment_author":"不懂就要说不说怎么知道对错","comment_date":"2019-06-29 14:59:58","comment_date_int":1561791598,"comment_content":" <a href=\"#tucao-5206353\" data-id=\"5206353\" class=\"tucao-link\">@乐色分类</a> 因为不同的人对谈恋爱的定义不一样。\n有的人觉得谈恋爱重点在于谈(婚论嫁),有的人觉得在于恋爱","comment_parent":"4286850","comment_reply_ID":"0","is_jandan_user":0,"is_tip_user":0,"vote_positive":"0","vote_negative":"0"}],"has_next_page":false}
{'code': 0, 'hot_tucao': [{'comment_ID': '5206422', 'comment_post_ID': '102312', 'comment_author': '不懂可以不说', 'comment_date': '2019-06-29 14:12:43', 'comment_date_int': 1561788763, 'comment_content': '  <a href="#tucao-5206353" data-id="5206353" class="tucao-link">@乐色分类</a> 很简单。我喜欢她 = 我恋爱了。', 'comment_parent': '4286850', 'comment_reply_ID': '0', 'is_jandan_user': 0, 'is_tip_user': 0, 'vote_positive': '11', 'vote_negative': '0'}], 'tucao': [{'comment_ID': '5206353', 'comment_post_ID': '102312', 'comment_author': '乐色分类', 'comment_date': '2019-06-29 14:01:29', 'comment_date_int': 1561788089, 'comment_content': '所以你是怎么定义谈恋爱的?看蛋友故事多了我不确定你们对谈恋爱的标准…', 'comment_parent': '4286850', 'comment_reply_ID': '0', 'is_jandan_user': 0, 'is_tip_user': 0, 'vote_positive': '4', 'vote_negative': '0'}, {'comment_ID': '5206422', 'comment_post_ID': '102312', 'comment_author': '不懂可以不说', 'comment_date': '2019-06-29 14:12:43', 'comment_date_int': 1561788763, 'comment_content': '  <a href="#tucao-5206353" data-id="5206353" class="tucao-link">@乐色分类</a> 很简单。我喜欢她 = 我恋爱了。', 'comment_parent': '4286850', 'comment_reply_ID': '0', 'is_jandan_user': 0, 'is_tip_user': 0, 'vote_positive': '12', 'vote_negative': '0'}, {'comment_ID': '5206522', 'comment_post_ID': '102312', 'comment_author': '慕行秋', 'comment_date': '2019-06-29 14:31:29', 'comment_date_int': 1561789889, 'comment_content': '  <a href="#tucao-5206353" data-id="5206353" class="tucao-link">@乐色分类</a> 众所周知,蛋友的恋爱==暗恋', 'comment_parent': '4286850', 'comment_reply_ID': '0', 'is_jandan_user': 0, 'is_tip_user': 0, 'vote_positive': '3', 'vote_negative': '0'}, {'comment_ID': '5206613', 'comment_post_ID': '102312', 'comment_author': '不懂就要说不说怎么知道对错', 'comment_date': '2019-06-29 14:59:58', 'comment_date_int': 1561791598, 'comment_content': ' <a href="#tucao-5206353" data-id="5206353" class="tucao-link">@乐色分类</a> 因为不同的人对谈恋爱的定义不一样。\n有的人觉得谈恋爱重点在于谈(婚论嫁),有的人觉得在于恋爱', 'comment_parent': '4286850', 'comment_reply_ID': '0', 'is_jandan_user': 0, 'is_tip_user': 0, 'vote_positive': '0', 'vote_negative': '0'}], 'has_next_page': False}

虽然暂时能跑了,但是无法处理内部引号前有{}[],的情况,而且似乎效率有点低…不知道这个问题有没有更好的解决方法了…誒…

------------- END OF FILE meow~-------------