Finding the most cited presenters

Sep 18, 2019 Web scraping, R

The 3rd International Conference on Econometrics and Statistics (EcoSta 2019) took place at the National Chung Hsing University (NCHU), Taichung, Taiwan 25-27 June 2019. The conference consisted of 10 parallel sessions, each having 14-17 sessions with 3-5 speakers occurring at the same time. The full programme is available here.

Naturally, it was quite the optimization problem to pick which sessions to attend. For parallel sessions where multiple sessions appeared interesting and relevant for my research, my final choice became rather arbitrary.

After the conference, I decided to put my newly acquired web scraping skills to good use. After collecting the names of the 593 presenting authors at the conference (and the co-authors of their presented papers) from the conference web site, I scraped the Google Scholar profile of each author, to obtain a citation count for each of the 150 sessions.

The code

I started by scraping sessions, authors and titles using the methods from my previous post. Then I wrote a function which calculates the total citation count for a session (consisting of 3-5 presentations). Calling the function for each session, and calculating the column sum of the returned citation matrix, gives the desired results.

Citation count function

# The input is a vector of URLs (each talk has an info web page)
# Returns a Nx2 matrix:
# First column: Presenting authors citation sum 
# Second column: Total citation sum
scoreFunc=function(talklist)
{
  N = length(talklist)
  score=matrix(0,nrow=N,ncol=2)
  ScholarURL="https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q="

  for(u in 1:N)
  {
    talk_html = read_html(talklist[u],encoding ="UTF-8")
    talk=talk_html%>%html_node(".newline+ span")%>%html_text()

    authors=strsplit(talk, split="\\)")[[1]]
    authors=gsub("\\[presenting]","",authors)
    for(i in 1:length(authors))
    {
      authors[i]=gsub(" ","+",strsplit(authors[i], split="\\(")[[1]][1])
      if(nchar(authors[i])>3)
      {
        author_html = read_html(paste(ScholarURL,authors[i],sep=""))
        tmp=strsplit(toString(author_html%>%html_node("body")%>%html_text()), split="Cited by ")[[1]]
        if(grepl("Verified email", tmp[1], fixed=TRUE))
        {
          tmpscore=as.numeric(gsub("([0-9]+).*$", "\\1", tmp[2]))
          if(i==1){score[u,1]=tmpscore}
          score[u,2]=score[u,2]+tmpscore
        }
      }
    }
  }
  return(score)
}

Results

Parallel session B

Finding the most cited presenters

The code

Results

Parallel session B

Parallel session C

Parallel session D

Parallel session F

Parallel session G

Parallel session H

Parallel session I

Parallel session K

Parallel session L

Parallel session M

Kjartan Kloster Osmundsen

Data analyst