机器学习论文的统计 · Xiaoting Tang's Site

这是一篇统计IEEE上关于机器学习论文的文章。
我是一个机器学习的新手，一开始我找不到应该读哪些论文，于是我想，干嘛不用python统计一下哪些论文最重要呢？
于是有了这篇文章。

第一步，是获得数据。这一次我选择了IEEE Xplore作为数据来源。这是它的网站界面：

让我们做一个简单的搜索，看看它的网站有什么变化：

我们发现搜索结果看起来是有结构的，很好！让我们再来看看它的HTML结构：

我们发现每一个搜索结果都被包裹在一个div里面，这个div的类是List-results-items。我们还可以看一眼div里面的结构，发现我们关心的几个数据：标题，年份，作者，引用次数，期刊，摘要。

我打算写一个python的爬虫，先把网页的源代码扒下来，然后再对源代码进行解析，最后通过不同的HTML Tag取得我们想要的数据。

我一开始尝试使用urllib2进行访问，但获取到的源代码里并没有我想要的东西，于是我意识到这是一个动态网页，我们需要用一些动态网页的包进行访问。什么是动态网页呢? 简单的说，就是部分（或全部）内容是由Javascript生成的。这也解释了为什么一开始用urllib2进行访问不成功的原因，因为urllib2的请求并没有执行JS的能力，所以我们要的论文信息就不会被捕捉到。

讲了这么多，实际有一个叫做selenium的包就能做动态网页的请求。简单地说，selenium是一个为自动化网站测试而开发的程序，它几乎可以在各个方面模拟一个浏览器的行为。Selenium还有一个python的封包，正是我想要的。

安装selenium很简单，用pip的话：

1	pip install selenium

用conda的话：

1	conda install -c metaperl selenium=2.40.0

安装好了之后，打开你最喜欢的python编辑器
下面是一段简单的selenium 代码, 打开一个Firefox浏览器的实例：

1
2
3

from selenium import webdriver
driver = webdriver.Firefox()

一个新的Firefox浏览器应该会弹出来。

之后，我拟定了四个主题：’Machine Learning’, ‘Deep Learning’, ‘Data Mining’, ‘Neural Network’。我准备扒下IEEE Xplore里面关于这四个主题的所有paper信息。

下面是我的准备工作：

topics = {}
ml = {
    'keyword' : 'machine%20learning',
    'page_count' : 508,
    'total': 50800,
}
dl = {
    'keyword' : 'deep%20learning',
    'page_count' : 29,
    'total' : 2900,
}
nn = {
    'keyword' : 'neural%20network',
    'page_count' : 1221,
    'total' : 122100,
}
dm = {
    'keyword' : 'data%20mining',
    'page_count' : 847,
    'total' : 84700
}
topics['ml'] = ml
topics['dl'] = dl
topics['nn'] = nn
topics['dm'] = dm

上面的数字，比如'page_count': 508, 是在每一页100份论文的前提下，IEEE XPlore的最大页数。

准备工作完成，我们可以开始抓取source了。

# 抓取source 分步
soups = []
for topic in topics:
    print('Topic: %s',topics[topic]['keyword'])
    length = topics[topic]['page_count']
    pbar = ProgressBar(length)  # 显示进度
    for i in xrange(1, topics[topic]['page_count']):        
        driver.get('http://ieeexplore.ieee.org/search/searchresult.jsp?newsearch=true&queryText='+ topics[topic]['keyword'] +'&rowsPerPage=100&pageNumber='+str(i))
        time.sleep(5)
        for j in xrange(5):  # 重要： 因为网页只有下拉之后才会加载，所以这里连续下拉5次，每次留给服务器1秒钟的加载时间
            driver.execute_script("window.scrollTo(0, "+ str(100000) +");")
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(2)
        source = driver.page_source
        soup = bs(source)
        soups.append(soup)
        pbar.increment()
    pbar.finish()

事实证明，动态网页的抓取非常耗时，这里4个topic分别花了我:

主题	页数	时间
Machine Learning	508	7.29 hr
Deep Learning	29	0.58 hr
Data Mining	847	10.72 hr
Neural Network	1221	9.74 hr

也就是两天。

不过，拿到的数据量还是非常可喜的，各个主题的源代码加在一起有1.33G左右。

下面，就是分析的工作。我从源代码中发现的数据规律可以在这里应用：

# Parsing
import unicodecsv as csv
years = []
titles = []
authors = []
publishers = []
abstracts = []
citeds = []
def innerHTML(element):  # 获取节点中的内容
    return element.decode_contents(formatter="html")
base = 0
for topic in topics:
    for soup in soups[base:base+topics[topic]['page_count']]:
        results = soup.select('div.List-results-items')
        for r in results:
            title = r.select('h2 a.ng-binding')
            if not title:
                title = r.select('h2 span')
            year = r.select('span[ng-if="::record.publicationYear"]')
            author = r.select('span[ng-bind-html="::author.preferredName"]')
            publisher = r.select('a[ng-bind-html="::record.publicationTitle"]')
            abstract = r.select('span[ng-bind-html="::record.abstract"]')
            cited = r.select('span[ng-if="::record.citationCount"]')
            title, n = re.subn("\\<.*?\\>", '', innerHTML(title[0]))
            year, n = re.subn("\\<.*?\\>", '', innerHTML(year[0]))
            year, n = re.subn('\s*?', '', year) # 去掉无用的空格
            year, n = re.subn('Year:', '', year) # 去电'Year:'字段
            if not author:
                author = u'No author'  # 可能出现没有作者的情况
            else:
                author, n = re.subn("\\<.*?\\>", '', innerHTML(author[0]))  # 去掉内容里HTML标签的部分
            publisher, n = re.subn("\\<.*?\\>", '', innerHTML(publisher[0]))
            if not abstract:
                abstract = u'No Abstract'
            else:                
                abstract, n = re.subn("\\<.*?\\>", '', innerHTML(abstract[0]))
            if not cited:
                cited = u'0'
            else:
                cited, n = re.subn("<.*?>", '', innerHTML(cited[0]))
                cited = cited.replace('Papers(', '') # 去掉Papers(字段
                cited = cited.replace(')', '') # 去掉）
            # 把它们存在csv里
            with open(topic+'.csv', 'a') as f:
                writer = csv.writer(f, encoding='utf8', delimiter=',')
                row = [title, author, year, cited, publisher, abstract]
                writer.writerow(row)                
    base += topics[topic]['page_count']

这样就得到了4个内容很整齐，去掉无关信息的csv文件。

长这样：

值得一提的是，超过1.33G的源代码信息，经过我们的处理之后，每个csv文件只有20MB左右大小。
不由得感叹一句，大数据虽大，但是可能有用的数据其实就那么一点。

接下来，就是分析这些数据里隐藏的信息了。

我有6个目标：

列出被引用最多的文章
列出被引用最多的作者
列出写最多文章的作者
列出被引用最多的出版商
标题里的词频分析
摘要里的词频分析

为了快速地进行分析运算，我打算使用pandas包作为运算工具，matplotlib作为最后的画图工具，下面是我的准备工作：

#-*- coding:utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook  # 为了在jupyter notebook 里inline 显示图像
mldf = pd.read_csv('csv/ml.csv', names=['Title', 'Author', 'Year', 'CitedCount', 'Publisher', 'Abstract'])
dmdf = pd.read_csv('csv/dm.csv', names=['Title', 'Author', 'Year', 'CitedCount', 'Publisher', 'Abstract'])
dldf = pd.read_csv('csv/dl.csv', names=['Title', 'Author', 'Year', 'CitedCount', 'Publisher', 'Abstract'])
nndf = pd.read_csv('csv/nn.csv' , names=['Title', 'Author', 'Year', 'CitedCount', 'Publisher', 'Abstract'])
def get_top(group):  # 把DataFrame根据CitedCount倒序排列
    return group.sort_index(by='CitedCount', ascending=False)
ml_top = get_top(mldf)
dm_top = get_top(dmdf)
dl_top = get_top(dldf)
nn_top = get_top(nndf)
dfs = {  # 创建一个DataFrame的字典， 以便后面的遍历
    'ml': ml_top,
    'dm': dm_top,
    'dl': dl_top,
    'nn': nn_top,
}

画出最多引用的论文

def drawMostCitedPaper(CountList, TitleList, c='#E65235', name='samplefigure'):
    fig, ax = plt.subplots(figsize=(10, 40))
    l = len(CountList)
    ax.barh(range(l), CountList, color=c, align='center')
    ax.set_title('Top' + str(l) + 'Cited Paper', fontsize=14)
    ax.set_yticks(range(l))
    ax.set_ylim([0, l])
    ax.set_yticklabels(TitleList, fontsize=14)
    ax.set_xlabel('Citation Count')
#     plt.show()
    fig.savefig('img/'+name, bbox_inches='tight')
# 正式开始画
drawMostCitedPaper(ml_top.CitedCount.tolist()[:-100:-1], ml_top100.Title.tolist()[::-1], name='ml')
drawMostCitedPaper(dl_top.CitedCount.tolist()[:-100:-1], dl_top100.Title.tolist()[::-1], c='#00D680', name='dl')
drawMostCitedPaper(dm_top.CitedCount.tolist()[:-100:-1], dm_top100.Title.tolist()[::-1], c='#D866D6', name='dm')
drawMostCitedPaper(nn_top.CitedCount.tolist()[:-100:-1], nn_top100.Title.tolist()[::-1], c='#EBC12B', name='nn')

图：

Machine Learning Top 100 Cited Papers
Data Mining Top 100 Cited Papers
Deep Learning Top 100 Cited Papers
Neural Network Top 100 Cited Papers

画出最多引用的作者

def drawTop100CitedPerson(topic='ml', c='#E65235', name='samplefigure'):
    df = dfs[topic]
    keywords = {
        'ml' : 'Machine Learning',
        'nn' : 'Neural Network',
        'dm' : 'Data Mining',
        'dl' : 'Deep Learning',
    }
    authordf = df.groupby(['Author']).CitedCount.sum()   # 重要：对group之后的DataFrame执行求和操作
    authordf = authordf.sort_values(ascending='True')
    CountList = list(reversed(authordf[:-101:-1].tolist()))
    NameList = list(reversed(authordf.index.values.tolist()[:-101:-1]))
    Namelist = [unicode(x, errors='replace') for x in Namelist]
    fig, ax = plt.subplots(figsize=(10, 40))
    l = len(CountList)
    ax.barh(range(l), CountList, color=c, align='center')
    ax.set_title(keywords[topic] +' Top ' + str(l) + ' Cited Person', fontsize=14)
    ax.set_yticks(range(l))
    ax.set_ylim([0, l])
    ax.set_yticklabels(NameList, fontsize=14)
    ax.set_xlabel('Citation Count')
    plt.show()
    fig.savefig('img/'+name, bbox_inches='tight')
# 正式开始作图
drawTop100CitedPerson(topic='ml', c='#49C12B', name='ml_person')
drawTop100CitedPerson(topic='nn', c='#EBC12B', name='nn_person')
drawTop100CitedPerson(topic='dm', c='#D866D6', name='dm_person')
drawTop100CitedPerson(topic='dl', c='#00D680', name='dl_person')

结果如下：

Machine Learning Top 100 Cited Author
Data Mining Top 100 Cited Author
Deep Learning Top 100 Cited Author
Neural Network Top 100 Cited Author

列出写最多产的作者

因为代码和最多引用作者大同小异，所以在下面就不列出来了，有兴趣的朋友可以在文末找到链接。

结果是这样：

Machine Learning Top 100 Prolific Author
Data Mining Top 100 Prolific Author
Deep Learning Top 100 Prolific Author
Neural Network Top 100 Prolific Author

有意思的一点是，在Machine Learning领域，前100多产的作者里面，有71位都是华人，但是在前100被引用次数最多的作者里面，只有27位是华人。

列出被引用最多的出版商

Machine Learning Top 100 Prolific Author
Data Mining Top 100 Prolific Author
Deep Learning Top 100 Prolific Author
Neural Network Top 100 Prolific Author

标题里的词频分析

我想知道在这二十几万篇论文标题里面出现次数最多的词语是什么。词频很好计算，用我们刚刚的DataFrame，改一下过滤条件，再排个序，就可以做到。
为了把频率以直观的方式表现出来，我选择用词云的形式。

Python里有一个专门做词云的包WordCloud，简单易用。想要知道更多关于使用WordCloud的信息，可以参见我上一篇博客

# 分词并获得词根，以此来累计频率
def getWordCount(df):
    dflist = df.Abstract.tolist()
#     tagged_corpus = [pos_tag(word_tokenize(document)) for document in dflist]
    tagged_corpus = []
    for document in dflist:
        try:
            tmp = word_tokenize(document)
        except:
            break
        tagged_corpus.append(pos_tag(tmp))
    def lemmatize(token, tag):
        if tag.lower() in ['n', 'v']:
            return lemmatizer.lemmatize(token, tag)
        return token
    lemmatizer = WordNetLemmatizer()
    lemmatized_corpus = [[lemmatize(token, tag) for token, tag in document] for document in tagged_corpus]
    tmp = []
    for title in lemmatized_corpus:
        for word in title:
            tmp.append(word.lower())
    lemmatized_corpus = tmp
    mystopwords = [
        ',',
        ':',
        ';',
        '.',
        '(',
        ')',
        '[',
        ']',
        "'s",
        '-',
        '?',
        "'",
        '%',
        '&',
        '...',
        '…',
        '.'
    ]
    count = Counter(tmp).items()
    count = sorted(count, key=lambda x:x[1], reverse=True)
    count = [(w, c) for w, c in count if w.lower() not in stopwords.words('english') and w not in mystopwords]
    return count

# 定义词云函数
#-*- coding: utf-8 -*-
def drawWordCloud(frequency, name='samplecloud') :
    from scipy.misc import imread
    import matplotlib.pyplot as plt
    from progressbar import ProgressBar
    from wordcloud import WordCloud, ImageColorGenerator
    %matplotlib notebook
    fig, ax = plt.subplots(figsize=(40, 40), dpi=60)
    img_colors = ImageColorGenerator(img_mask)
    wc = WordCloud(background_color="white",
    max_font_size=200, #字体最大值
    random_state=2,
    relative_scaling=1,)
#     prefer_horizontal=True )
#     ranks_only=True)
    wc.fit_words(frequency)
    plt.imshow(wc)
    plt.axis('off')
    plt.show()
    plt.savefig('img/'+name)

# 正式画图
for topic in dfs:
    df = dfs[topic]
    count = getWordCount(df)
    drawWordCloud(count, name=topic + '_wordcloud')

下面是结果：

Machine Learning:

Data Mining:

Deep Learning:

Neural Networks:

源代码：爬虫和处理

谢绝转载