Over a million developers have joined DZone.

Removing Uncited References in a Tex File (with R)

· Big Data Zone

Read this eGuide to discover the fundamental differences between iPaaS and dPaaS and how the innovative approach of dPaaS gets to the heart of today’s most pressing integration problems, brought to you in partnership with Liaison.

Last week, with @3wen, we were working a the revised version of our work on smoothing densities of spatial processes (with edge correction). Usually, once you have revised the paper, some references were added, others were dropped. But you need to spend some time to check that all references are actually mentioned in the paper. For instance, consider the following compiled tex file:

Only three references are actually mentioned in the document, so we need to update the reference list (by removing the first three). If you use a bib file, it is very simple, and only cited references will appear in the list. The problem here is that we used bibitems,

I wanted to work on that manually this week-end, but @3wen suggested to write a simple R function to scan the tex f file (as well as the aux file actually) to remove uncited references. The idea is the following. First, let us scan the two files

> library(stringr)
> setwd("/home/tex/")
> file_tex <- scan("file_test.tex", what = "character", sep = "\n")
Read 15 items
> file_aux <- scan("file_test.aux", what = "character", sep = "\n")
Read 21 items

Then, we extract only parts related to the bibliography,

> beg_file <- which(str_detect(string = file_tex, pattern = "\\\\begin\\{thebibliography\\}"))
> end_file <- which(str_detect(string = file_tex, pattern = "\\\\end\\{thebibliography\\}"))

References here are the following lines

> biblio <- file_tex[seq(beg_file+1, end_file-1)]
> biblio
[1] "\\bibitem[Cressie(1991)]{Cressie} Cressie, N. (1991). Statistics for Spatial Data. New York: John Wiley \\& Sons"                                  
[2] "\\bibitem[Diggle (2002)]{Diggle} Diggle, P., Heagerty, P., Liang, K.Y. \\& Zeger, S. 2002. Analysis of Longitudinal Data. Oxford University Press."
[3] "\\bibitem[Ripley(1981)]{Ripley} Ripley, B. 1981. Spatial Statistics, Wiley, New York."                                             
[4] "\\bibitem[Scott(1992)]{Scott} Scott, D W 1992 Multivariate Density Estimation: Theory, Practice, and Visualization. New York, John Wiley and Sons."
[5] "\\bibitem[Silverman(2004)]{Silverman} Silverman B W 1986 Density Estimation for Statistics and Data Analysis."
[6] "London, Chapman \\& Hall."                                             [7] "\\bibitem[Wand \\& Jones(1995)]{Wand} Wand, M.P; Jones, M.C. (1995). Kernel Smoothing. London: Chapman \\& Hall/CRC. "

If you look carefully at the output, you can observe that the fifth reference is on two lines. Which might happend frequently. So we need to check precisely when a reference starts, and when it ends.

> beg_bibitem <- which(str_detect(string = biblio, pattern = "\\\\bibitem"))
> go_through <- cbind(beg_bibitem, c(beg_bibitem[-1]-1,length(biblio)))
> go_through
     beg_bibitem  
[1,]           1 1
[2,]           2 2
[3,]           3 3
[4,]           4 4
[5,]           5 6
[6,]           7 7

Actually, we should also check if a reference is cited. Sometimes, there are references with a comment sign.

> go_through <- data.frame(beg = beg_bibitem, end = rep(NA, length(beg_bibitem)))
> for(i in seq_len(length(beg_bibitem))-1){
+   go_through[i,2] <- beg_bibitem[i+1]-1
+ }
> go_through[nrow(go_through), 2] <- length(biblio)
> go_through$comment <- str_detect(biblio[beg_bibitem], "^%")
> go_through
  beg end comment
1   1   1   FALSE
2   2   2   FALSE
3   3   3   FALSE
4   4   4   FALSE
5   5   6   FALSE
6   7   7   FALSE

Let us now extract the labels of all the references (%).

> extract_ref_cite <- function(bibitem, file){
+   entree <- file[bibitem]
+   if(str_detect(entree, "bibitem\\[.*\\]\\{")){
+     nom_citation <- str_extract(entree, "]\\{(.*?)\\}")
+   }else{
+     nom_citation <- str_extract(entree, "\\{(.*?)\\}")
+   }
+   str_replace_all(string = nom_citation, pattern = "\\{|\\}|]", replacement = "")
+ }
> bibitems_ref <- unlist(lapply(beg_bibitem, extract_ref_cite, biblio))
> bibitems_ref
[1] "Cressie"   "Diggle"    "Ripley"    "Scott"     "Silverman" "Wand"

We have six references, with those labels (as expected).

Now, if we look at the aux file, to see which references are cited in the text,

> ind_cite <- which(str_detect(string = file_aux, pattern = "\\\\citation"))
> bibitems_cite_names <- unlist(lapply(ind_cite, extract_ref_cite, file_aux))
> bibitems_cite_names
[1] "Scott"     "Scott"     "Silverman" "Silverman" "Wand"      "Wand"     
[7] "Scott"     "Scott"

Note that references are mentioned twice (at least): once for the author’s name, once for the year of publication. Since we just need to see which one actually appears in the aux file, we can use

> bibitems_cite_names <- unique(bibitems_cite_names)
> bibitems_cite_names
[1] "Scott"     "Silverman" "Wand"

Now, we can see which references are cited,

> go_through$keep <- bibitems_ref %in% bibitems_cite_names
> go_through
  beg end comment  keep
1   1   1   FALSE FALSE
2   2   2   FALSE FALSE
3   3   3   FALSE FALSE
4   4   4   FALSE  TRUE
5   5   6   FALSE  TRUE
6   7   7   FALSE  TRUE

Based on that table, we can use a simple code: references that we do not need will be seen as comments, while those that are cited will appear in the reference list.

> return_cite <- function(one_ligne){
+   citation <- str_c(biblio[one_ligne[1,"beg"]:one_ligne[1,"end"]], collapse = "\n")
+   if(!one_ligne[1,"keep"] & !str_detect(citation, "^%")){
+     citation <- str_replace_all(citation, pattern = "\n", replacement =  "\n%")
+   }
+   citation
+ }

For instance,

> return_cite(go_through[1,])
[1] "%\\bibitem[Cressie(1991)]{Cressie} Cressie, N. (1991). Statistics for Spatial Data. New York: John Wiley \\& Sons"

since the first reference does not appear in the text, while

> return_cite(go_through[4,])
[1] "\\bibitem[Scott(1992)]{Scott} Scott, D W 1992 Multivariate Density Estimation: Theory, Practice, and Visualization. New York, John Wiley and Sons."

Now, we can easily generate our bibliography, in LaTeX

> cat(unlist(lapply(1:nrow(go_through), function(x) return_cite(go_through[x,]))), sep = "\n\n")
%\bibitem[Cressie(1991)]{Cressie} Cressie, N. (1991). Statistics for Spatial Data. New York: John Wiley \& Sons

%\bibitem[Diggle (2002)]{Diggle} Diggle, P., Heagerty, P., Liang, K.Y. \& Zeger, S. 2002. Analysis of Longitudinal Data. Oxford University Press.

%\bibitem[Ripley(1981)]{Ripley} Ripley, B. 1981. Spatial Statistics, Wiley, New York.

\bibitem[Scott(1992)]{Scott} Scott, D W 1992 Multivariate Density Estimation: Theory, Practice, and Visualization. New York, John Wiley and Sons.

\bibitem[Silverman(2004)]{Silverman} Silverman B W 1986 Density Estimation for Statistics and Data Analysis. 
London, Chapman \& Hall.

\bibitem[Wand \& Jones(1995)]{Wand} Wand, M.P; Jones, M.C. (1995). Kernel Smoothing. London: Chapman \& Hall/CRC.

We simply need to copy that list and paste it in our LaTeX file. Nice, isn’t it?

Discover the unprecedented possibilities and challenges, created by today’s fast paced data climate and why your current integration solution is not enough, brought to you in partnership with Liaison

Topics:

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}