Mendeley – Fixing Author and Journal Metadata

I use Mendeley to store, organize and manage my library of academic papers. It’s tagging and search features are excellent, and Mendeley helps keep the process of finding previously read literature manageable. At times I hear or see an author’s name and wonder what papers of theirs I have previously read. In Mendeley, you can select and view all of the papers by a single author quite easily. However, there is an issue with variations in names. When you filter by author, Mendeley has no way of knowing that “Mark Pauly”, “Mark V Pauly”, “MV Pauly” are all the same person. To fix this using the interface would be a painful, manual task involving selecting and editing each individual paper.

Fortunately, Mendeley stores the local library information on a relatively easy to access SQLite database, and I know SQL. What I did, and will show, was to find probable duplicate names and merge them together using SQL code.

Step 0: BACK UP YOUR DATABASE!

The main databases are stored in local AppData folder (“C:\Users\[Your_Windows_UserName]\AppData\Local\”) under “Mendeley Ltd\Mendeley Desktop”. What you are looking for is a file similar to “dl679@cornell.edu@www.mendeley.com.sqlite” . BACK THIS FILE UP BEFORE PROCEEDING.

Step 1: Get and install a program to read and write SQLite

There are a variety of tools out there, but I used, and can recommend, SQLiteStudio (https://sqlitestudio.pl/index.rvt).

Step 2: Connect to the database.

Open up SQLStudio and add the databse you located in step 0, then connect to the database (2). Once you have it open, you should be able to see the tables that Mendeley uses. The main one is “Documents”. This table has one entry for each of the articles in your database. The authors are stored in “DocumentContributors”.

Step 3: Find suspected duplicates

The following code generates a good list of authors that are probably duplicate entries.


SELECT DocumentContributors.LastName, DocumentContributors.firstNames, count(Distinct documents.id) as Num_Papers, max(Num_Papers_LN) as Num_Papers_LN
FROM DocumentContributors
INNER JOIN documents
on DocumentContributors.DocumentID = documents.id
INNER JOIN (
   SELECT LastName, count(Distinct documents.id) as Num_Papers_LN
   FROM DocumentContributors
   INNER JOIN documents
   on DocumentContributors.DocumentID = documents.id
   GROUP BY LastName
   ) as ln
on ln.lastName = DocumentContributors.lastName
GROUP BY DocumentContributors.firstNAmes, DocumentContributors.LastNAme
HAVING Num_Papers_LN > 8
ORDER BY Num_Papers_LN desc, DocumentContributors.LastNAme, Num_Papers desc;

I limited my list to authors (by last names) with more than 8 papers, as I did not want to spend too much time cleaning up smaller authors.

Step 4: Identify and update duplicates

Once I have my list I check to see if there are more than one author with the last name. The idea is to create a search string that uniquely identifies an author, than standardize the first name. For example, I searched for:


SELECT *
FROM DocumentContributors
WHERE lastName = 'Town'
and firstNames like "%";

In my database, all of the authors with a last name “Town” were in fact “Robert J Town”, so I could safely run the following update statement to standard the first name. I chose to omit periods, and use the first initial of the middle name, but your standardization procedure may differ:


UPDATE DocumentContributors
SET firstNAmes = "Robert J"
WHERE lastName = 'Town'
and firstNames like "%";

If there are multiple authors with the same last name, you can include a first initial before the % to filter out based on that. Always check to see what is going to be impacted by your query before running an update statement.

Step 5: Make Mendeley update the search index

Finally, to get Mendeley to update the search index (including the index of authors), with Mendeley closed, delete the files in “…\AppData\Local\Mendeley Ltd\Mendeley Desktop\www.mendeley.com\dl679@cornell.edu-3ab2” (your final subfolder will differ). BACKUP THESE FILES FIRST. Deleting (and having Mendeley rebuild) this file could very well speed up search if it has become slow.

One caveat, this approach will not update the information in your library on Mendeley.com, and will not sync to other computers. I think this can be accomplished by updating the “eventLog” and “eventAttributes” tables, but I didn’t have the time to write up a sufficiently automated process, but think something could be done fairly easily using Python.

Step 6: See which authors you read the most


SELECT DocumentContributors.LastName, DocumentContributors.firstNames, count(Distinct documents.id) as Num_Papers
FROM DocumentContributors
INNER JOIN documents
on DocumentContributors.DocumentID = documents.id
on ln.lastName = DocumentContributors.lastName
GROUP BY DocumentContributors.firstNAmes, DocumentContributors.LastNAme
HAVING Num_Papers > 5
ORDER BY Num_Papers desc;

A side benefit of this project is that I can see which authors are particularly important to my research by running the above code. Not surprisingly, my advisor’s advisor, Marty Gaynor, tops the list (along with the very prolific Larry Casalino). Amitabh Chandra, David Dranove, Bruce Landon and Robert Town are next up (can you tell I research health economics!?).

I also used a similar process to clean up the journal names, and then checked which journals I read most often. Health affairs topped the list followed by NBER Working Papers. There is a big gap between the next couple: Journal of Health Economics, Health Services Research, Journal of Economic Perspectives and American Economic Review.


Leave a Reply