Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Removing Uncited References in a Tex File (with R)

DZone's Guide to

Removing Uncited References in a Tex File (with R)

· Big Data Zone
Free Resource

Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free Download 

Last week, with @3wen, we were working a the revised version of our work on smoothing densities of spatial processes (with edge correction). Usually, once you have revised the paper, some references were added, others were dropped. But you need to spend some time to check that all references are actually mentioned in the paper. For instance, consider the following compiled tex file:

Only three references are actually mentioned in the document, so we need to update the reference list (by removing the first three). If you use a bib file, it is very simple, and only cited references will appear in the list. The problem here is that we used bibitems,

I wanted to work on that manually this week-end, but @3wen suggested to write a simple R function to scan the tex f file (as well as the aux file actually) to remove uncited references. The idea is the following. First, let us scan the two files

> library(stringr)
> setwd("/home/tex/")
> file_tex <- scan("file_test.tex", what = "character", sep = "\n")
Read 15 items
> file_aux <- scan("file_test.aux", what = "character", sep = "\n")
Read 21 items

Then, we extract only parts related to the bibliography,

> beg_file <- which(str_detect(string = file_tex, pattern = "\\\\begin\\{thebibliography\\}"))
> end_file <- which(str_detect(string = file_tex, pattern = "\\\\end\\{thebibliography\\}"))

References here are the following lines

> biblio <- file_tex[seq(beg_file+1, end_file-1)]
> biblio
[1] "\\bibitem[Cressie(1991)]{Cressie} Cressie, N. (1991). Statistics for Spatial Data. New York: John Wiley \\& Sons"                                  
[2] "\\bibitem[Diggle (2002)]{Diggle} Diggle, P., Heagerty, P., Liang, K.Y. \\& Zeger, S. 2002. Analysis of Longitudinal Data. Oxford University Press."
[3] "\\bibitem[Ripley(1981)]{Ripley} Ripley, B. 1981. Spatial Statistics, Wiley, New York."                                             
[4] "\\bibitem[Scott(1992)]{Scott} Scott, D W 1992 Multivariate Density Estimation: Theory, Practice, and Visualization. New York, John Wiley and Sons."
[5] "\\bibitem[Silverman(2004)]{Silverman} Silverman B W 1986 Density Estimation for Statistics and Data Analysis."
[6] "London, Chapman \\& Hall."                                             [7] "\\bibitem[Wand \\& Jones(1995)]{Wand} Wand, M.P; Jones, M.C. (1995). Kernel Smoothing. London: Chapman \\& Hall/CRC. "

If you look carefully at the output, you can observe that the fifth reference is on two lines. Which might happend frequently. So we need to check precisely when a reference starts, and when it ends.

> beg_bibitem <- which(str_detect(string = biblio, pattern = "\\\\bibitem"))
> go_through <- cbind(beg_bibitem, c(beg_bibitem[-1]-1,length(biblio)))
> go_through
     beg_bibitem  
[1,]           1 1
[2,]           2 2
[3,]           3 3
[4,]           4 4
[5,]           5 6
[6,]           7 7

Actually, we should also check if a reference is cited. Sometimes, there are references with a comment sign.

> go_through <- data.frame(beg = beg_bibitem, end = rep(NA, length(beg_bibitem)))
> for(i in seq_len(length(beg_bibitem))-1){
+   go_through[i,2] <- beg_bibitem[i+1]-1
+ }
> go_through[nrow(go_through), 2] <- length(biblio)
> go_through$comment <- str_detect(biblio[beg_bibitem], "^%")
> go_through
  beg end comment
1   1   1   FALSE
2   2   2   FALSE
3   3   3   FALSE
4   4   4   FALSE
5   5   6   FALSE
6   7   7   FALSE

Let us now extract the labels of all the references (%).

> extract_ref_cite <- function(bibitem, file){
+   entree <- file[bibitem]
+   if(str_detect(entree, "bibitem\\[.*\\]\\{")){
+     nom_citation <- str_extract(entree, "]\\{(.*?)\\}")
+   }else{
+     nom_citation <- str_extract(entree, "\\{(.*?)\\}")
+   }
+   str_replace_all(string = nom_citation, pattern = "\\{|\\}|]", replacement = "")
+ }
> bibitems_ref <- unlist(lapply(beg_bibitem, extract_ref_cite, biblio))
> bibitems_ref
[1] "Cressie"   "Diggle"    "Ripley"    "Scott"     "Silverman" "Wand"

We have six references, with those labels (as expected).

Now, if we look at the aux file, to see which references are cited in the text,

> ind_cite <- which(str_detect(string = file_aux, pattern = "\\\\citation"))
> bibitems_cite_names <- unlist(lapply(ind_cite, extract_ref_cite, file_aux))
> bibitems_cite_names
[1] "Scott"     "Scott"     "Silverman" "Silverman" "Wand"      "Wand"     
[7] "Scott"     "Scott"

Note that references are mentioned twice (at least): once for the author’s name, once for the year of publication. Since we just need to see which one actually appears in the aux file, we can use

> bibitems_cite_names <- unique(bibitems_cite_names)
> bibitems_cite_names
[1] "Scott"     "Silverman" "Wand"

Now, we can see which references are cited,

> go_through$keep <- bibitems_ref %in% bibitems_cite_names
> go_through
  beg end comment  keep
1   1   1   FALSE FALSE
2   2   2   FALSE FALSE
3   3   3   FALSE FALSE
4   4   4   FALSE  TRUE
5   5   6   FALSE  TRUE
6   7   7   FALSE  TRUE

Based on that table, we can use a simple code: references that we do not need will be seen as comments, while those that are cited will appear in the reference list.

> return_cite <- function(one_ligne){
+   citation <- str_c(biblio[one_ligne[1,"beg"]:one_ligne[1,"end"]], collapse = "\n")
+   if(!one_ligne[1,"keep"] & !str_detect(citation, "^%")){
+     citation <- str_replace_all(citation, pattern = "\n", replacement =  "\n%")
+   }
+   citation
+ }

For instance,

> return_cite(go_through[1,])
[1] "%\\bibitem[Cressie(1991)]{Cressie} Cressie, N. (1991). Statistics for Spatial Data. New York: John Wiley \\& Sons"

since the first reference does not appear in the text, while

> return_cite(go_through[4,])
[1] "\\bibitem[Scott(1992)]{Scott} Scott, D W 1992 Multivariate Density Estimation: Theory, Practice, and Visualization. New York, John Wiley and Sons."

Now, we can easily generate our bibliography, in LaTeX

> cat(unlist(lapply(1:nrow(go_through), function(x) return_cite(go_through[x,]))), sep = "\n\n")
%\bibitem[Cressie(1991)]{Cressie} Cressie, N. (1991). Statistics for Spatial Data. New York: John Wiley \& Sons

%\bibitem[Diggle (2002)]{Diggle} Diggle, P., Heagerty, P., Liang, K.Y. \& Zeger, S. 2002. Analysis of Longitudinal Data. Oxford University Press.

%\bibitem[Ripley(1981)]{Ripley} Ripley, B. 1981. Spatial Statistics, Wiley, New York.

\bibitem[Scott(1992)]{Scott} Scott, D W 1992 Multivariate Density Estimation: Theory, Practice, and Visualization. New York, John Wiley and Sons.

\bibitem[Silverman(2004)]{Silverman} Silverman B W 1986 Density Estimation for Statistics and Data Analysis. 
London, Chapman \& Hall.

\bibitem[Wand \& Jones(1995)]{Wand} Wand, M.P; Jones, M.C. (1995). Kernel Smoothing. London: Chapman \& Hall/CRC.

We simply need to copy that list and paste it in our LaTeX file. Nice, isn’t it?

The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high performance NoSQL & Big Data Access.

Topics:

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}