Teaming Data –> RootNPI

The DocGraph / CareSet team does great work and I have personally benefited from the availability of their original CMS teaming data, even using it in a chapter from my dissertation.

They recently updated their methodology and created a new group of datasets they call “root NPI“. Along with this update, they will no longer be updating the original format teaming data. While I understand the need for this change, the fact that they have neither updated the original data, nor retroactively created the new RootNPI data (beyond 2014) is a problem for me as I use the time variation in these datasets and would like to be able to add years.

To get around this limitation I created a method, to a fairly close approximation, creates the new data sets from the old, and therefore allows me to perform analysis on data from 2009 to 2015. The idea is to take the 180 day files and make them symmetrical. My commented SAS code is here, but the main commands are:

/* Duplicate the teaming data, switching NPI_Number1 and NPI_Number2 */
DATA phy_ref_2014_180_x2;
SET ref_med.phy_ref_2014_180(rename=(NPI_Number2=NPI_A NPI_Number1=NPI_B))
   ref_med.phy_ref_2014_180(rename=(NPI_Number1=NPI_A NPI_Number2=NPI_B));

/* Choose the NPI pair with the most patients */
proc sql;
CREATE TABLE npiroot_2014_180
   , MAX(Bene_Count) as patient_count
FROM phy_ref_2014_180_x2
; quit;

Complete data exist from both data sets for 2014, which allows me to compare the effectiveness of my transformation method. Here are some statistics from the comparison:

  • For the pairs that match, there is a very high correlation between the two (0.97938, see scatter below)
  • While 37.5% of the pairs do not match (50m of 185m), these pairs only account for ~10% of the total number of shared patient connections (~800m of ~7.5B)
  • It looks like most of the missing connections happen right near the 11 cutoff.
    • In fact, for 30% of missing pairs the present pair has a patient count of 11
    • It is 19% for 12, 13% for 13, 9% for 14
    • 90% have fewer than 22, 95% <33, and more than 99% fewer than 100
  • There seems to be a decently large, non-random set of providers that are in the new data, but not in the old. They all seem to be medical device related. Here are the top 10: Arriva Medical, Degc Enterprises, Lincare, All American Medical Supplies, United States Medical Supply, Med-Care Diabetic & Medical Supplies, Ocean Home Health Supply, Binson’s Hospital Supplies, Passaic Healthcare Services, DJO
  • There is not a similar pattern for providers in the old data, but not in the new. For consistency across years, I will probably exclude the above set of providers from my analyses.
  • Here is a comparison of the two datasets in terms of number of patients (the strength of the connection (RootNPI vs Constructed)
    • Median: 21 and 21
    • Mean: 41.6 vs 43.4
    • 95 percentile: 128 vs 134
    • 99 percentile: 347 vs 378
    • Standard Deviation: 104.5 vs 116.6
  • It may seem odd that the constructed data set has a larger average than the new, Root NPI data, since the new one is using the full year to define a connection, while the old data set used the 180 day window. I think what accounts for the discrepency is the fact that the old data set included connections that happen 6 months after Dec 2014, which the RootNPI omits.

Finally, to ensure that there is not any odd systematic variations between the two measure, I created a scatter plot comparing the patient count I calculated from the original teaming data with the new RootNPI patient count. I truncated the plot at 1000, both because there are just too many obs <1000 and because I am mainly interested in what happens to the relationship as both get large.

To me, this looks really reassuring. The two measures seem to be similar, with some noise, and this noise appears to get smaller as the number of patients gets larger.

Mendeley – Fixing Author and Journal Metadata

I use Mendeley to store, organize and manage my library of academic papers. It’s tagging and search features are excellent, and Mendeley helps keep the process of finding previously read literature manageable. At times I hear or see an author’s name and wonder what papers of theirs I have previously read. In Mendeley, you can select and view all of the papers by a single author quite easily. However, there is an issue with variations in names. When you filter by author, Mendeley has no way of knowing that “Mark Pauly”, “Mark V Pauly”, “MV Pauly” are all the same person. To fix this using the interface would be a painful, manual task involving selecting and editing each individual paper.

Fortunately, Mendeley stores the local library information on a relatively easy to access SQLite database, and I know SQL. What I did, and will show, was to find probable duplicate names and merge them together using SQL code.


The main databases are stored in local AppData folder (“C:\Users\[Your_Windows_UserName]\AppData\Local\”) under “Mendeley Ltd\Mendeley Desktop”. What you are looking for is a file similar to “” . BACK THIS FILE UP BEFORE PROCEEDING.

Step 1: Get and install a program to read and write SQLite

There are a variety of tools out there, but I used, and can recommend, SQLiteStudio (

Step 2: Connect to the database.

Open up SQLStudio and add the databse you located in step 0, then connect to the database (2). Once you have it open, you should be able to see the tables that Mendeley uses. The main one is “Documents”. This table has one entry for each of the articles in your database. The authors are stored in “DocumentContributors”.

Step 3: Find suspected duplicates

The following code generates a good list of authors that are probably duplicate entries.

SELECT DocumentContributors.LastName, DocumentContributors.firstNames, count(Distinct as Num_Papers, max(Num_Papers_LN) as Num_Papers_LN
FROM DocumentContributors
INNER JOIN documents
on DocumentContributors.DocumentID =
   SELECT LastName, count(Distinct as Num_Papers_LN
   FROM DocumentContributors
   INNER JOIN documents
   on DocumentContributors.DocumentID =
   GROUP BY LastName
   ) as ln
on ln.lastName = DocumentContributors.lastName
GROUP BY DocumentContributors.firstNAmes, DocumentContributors.LastNAme
HAVING Num_Papers_LN > 8
ORDER BY Num_Papers_LN desc, DocumentContributors.LastNAme, Num_Papers desc;

I limited my list to authors (by last names) with more than 8 papers, as I did not want to spend too much time cleaning up smaller authors.

Step 4: Identify and update duplicates

Once I have my list I check to see if there are more than one author with the last name. The idea is to create a search string that uniquely identifies an author, than standardize the first name. For example, I searched for:

FROM DocumentContributors
WHERE lastName = 'Town'
and firstNames like "%";

In my database, all of the authors with a last name “Town” were in fact “Robert J Town”, so I could safely run the following update statement to standard the first name. I chose to omit periods, and use the first initial of the middle name, but your standardization procedure may differ:

UPDATE DocumentContributors
SET firstNAmes = "Robert J"
WHERE lastName = 'Town'
and firstNames like "%";

If there are multiple authors with the same last name, you can include a first initial before the % to filter out based on that. Always check to see what is going to be impacted by your query before running an update statement.

Step 5: Make Mendeley update the search index

Finally, to get Mendeley to update the search index (including the index of authors), with Mendeley closed, delete the files in “…\AppData\Local\Mendeley Ltd\Mendeley Desktop\\” (your final subfolder will differ). BACKUP THESE FILES FIRST. Deleting (and having Mendeley rebuild) this file could very well speed up search if it has become slow.

One caveat, this approach will not update the information in your library on, and will not sync to other computers. I think this can be accomplished by updating the “eventLog” and “eventAttributes” tables, but I didn’t have the time to write up a sufficiently automated process, but think something could be done fairly easily using Python.

Step 6: See which authors you read the most

SELECT DocumentContributors.LastName, DocumentContributors.firstNames, count(Distinct as Num_Papers
FROM DocumentContributors
INNER JOIN documents
on DocumentContributors.DocumentID =
on ln.lastName = DocumentContributors.lastName
GROUP BY DocumentContributors.firstNAmes, DocumentContributors.LastNAme
HAVING Num_Papers > 5
ORDER BY Num_Papers desc;

A side benefit of this project is that I can see which authors are particularly important to my research by running the above code. Not surprisingly, my advisor’s advisor, Marty Gaynor, tops the list (along with the very prolific Larry Casalino). Amitabh Chandra, David Dranove, Bruce Landon and Robert Town are next up (can you tell I research health economics!?).

I also used a similar process to clean up the journal names, and then checked which journals I read most often. Health affairs topped the list followed by NBER Working Papers. There is a big gap between the next couple: Journal of Health Economics, Health Services Research, Journal of Economic Perspectives and American Economic Review.

Olympics 2018: Medals Recap

Some thoughts on the results of the 2018 Olympics:

Most pundits agree that this was a disappointing for Team USA. The haul of 23 medals is 5 fewer than 2014 and 14 down from 2010. The USOC had set a target of 37, with the expectation of at least 25, and the hope of up to 59. This decline is total medals is more severe when you consider that many of the medals were in US friendly events that previously did not exist (11 from snowy pursuits snowboarding and freestyle skiing). The USOC walked away with the same number of gold medals they have received since 2006 – 9, with 2002 only being one higher.

However, despite the shortfall in total medals, Team USA did have some notable victories: First, the women’s hockey team. As someone who attended two schools where hockey is the major sport (Colgate and Cornell… go Colgate!) I really enjoy hockey. I watched the utter heartbreak of the US women’s 2014 loss, made extra difficult by the fact that in the Olympics there is no “get ’em next year”. It’s get them in 4 years… This year the gold medal game lived up to the hype, including the final outcome. Second: the improbable victory in curling beating out both Canada and Sweden (which is another recent event predicted by the Simpsons more).

Here are some assorted notables:

  • Norway, Germany and Canada had great Olympics.
  • For Germany, it is return to form, after a bit of a slide from 2002-2014.
  • For Canada, it demonstrates that 2010 was not just a fluke and their rise to prominence is likely here to stay (unexpected losses in both Men and Women’s hockey, and Curling aside).
  • A historical note: It is crazy to think that in 1988 the sum total of golds for Canada, Norway and the USA was 2 (4% of the total). Lately it’s been around 33% (including this Olympics)
  • Russia/OAR dropped down to 2 golds. Some of it was undeniably the ban, but their haul of 13 in Sochi was a bit of an anomaly. They took home 3 in 2010 Vancouver
  • South Korea did not seem to receive much of a hosting bump. Their gold medal total was between their 2014 and 2010 count, and their total medal count has been steadily increasing since 2002.

Here is graph of the total medal count over time:

And here is the gold medal count over time:

1. The other prediction being the Trump presidency, which they predict will be followed by Lisa Simpson. Interestingly, Ted Cruz just argued that Lisa is a democrat.

Fantasy Playoff Probability

I crunched the numbers for my fantasy football league (methodology details below) and here are the results:

Team (current record) E(Wins) E(Rank) Playoffs 1st Round Bye
Cowboys (8-2) 10.19 1.27 100.0% 93.4%
Jets (7-3) 9.00 2.82 98.7% 61.3%
Vikings (6-4) 8.12 3.21 97.9% 6.1%
Pack (6-4) 7.78 4.34 89.4% 30.6%
Fourth (6-4) 7.39 5.51 74.9% 8.1%
Economists (5-5) 6.77 6.09 59.5% 0.1%
Eiferts (5-5) 6.61 6.08 59.0% 0.4%
Broncos (6-4) 6.59 7.41 17.2% 0.0%
NotGonnaLie (4-6) 4.73 9.64 3.2% 0.0%
Giants (3-7) 3.58 10.60 0.2% 0.0%
Wonders (2-8) 3.82 9.22 0.0% 0.0%
Blue (2-8) 2.54 11.80 0.0% 0.0%

The probability distribution of each team and their rank (grouped by tier) follows:Next week the Economists will play the Eiferts in what will almost be a playoff game. The winner has around a 90% probability of making the playoffs, while the loser has around a 30% probability.Even though the Broncos have a game on both the Economists (me) and the Eiferts, the algorithm is pretty bearish on the Bronco’s chances. Part of that is their 20 points below me (and 120 below the Eiferts) in the tiebreaker, but part of it is their more difficult schedule.Cowboys are definitely in. Jets and Vikings are almost surely in. Blue is almost surely out. The real battle is for those last two spots (5+6).

Top Tier 2017 11 16

Middle Crew 2017 11 16

Bottom Rung 2017 11 16



P.S. Methodology may follow in an update. Short version: used a version of Pythagorean expected wins to compute win probabilities (exponent of 6) – this seems to be similar to what Yahoo uses in their projections. Had to “guess” at some lineups since some teams have players on byes in future weeks. Then just computed the probabilities for each of the 262,144 possible win / loss outcomes (2^6=64 outcomes per week for three weeks). Also, had to make assumptions about points scored for the tiebreakers (gave the winner the greater of the two expectations).

Olympics 2016 Rio

It’s become a bit of a biannual tradition for me to write an about the Olympics – specifically the distribution of medals between countries.

I don’t have any sport by sport insight into how these games will look, but judging from the last couple years this should be another showdown between the US and China over Olympic supremacy. In three of the last four Olympics the United States has edged out China with China’s lone victory coming as a big win when they hosted in 2008.

Other medal count story lines to watch:

  • Can Britain build on their impressive growth, or was 2012 only about the home country bounce?
  • How will Russia do? Historically they were a powerhouse, but the breakup of the Soviet Union and the struggles of their economy let to a prolonged slump. Their economy has improved and they have seemed determined to reassert themselves on the world stage. Will that show up in their medal count? As a result of a huge doing scandal some of their athletes were banned. How much of a factor will that  be?
  • How will Brazil do? There’s been a lot of press coverage about how Rio want ready to host, but is Brazil ready to compete?
  • Will Japan see a spike add they prepare to host in 2020?

A new type of graph

[DRAFT] Introducing the Dan Chart (ok, that name won’t fly. The Dart? Darts fly!), or to name it descriptively, the aligned stacked bar chart. i have had trouble representing data where there are both a lot of categories (series), and a lot of periods (or values). The clustered column stops being useful around 5×5. The typical solution has been to use either the stacked column, or the 100% stacked column. Both of these have serious drawbacks and an easy solution.

First, observe how unintelligible the clustered column chart is:

The stacked bar char shows how the total changes over time, and tries to give you an idea of how the composition of that total changes. Even though changes in an individual category are represented in the chart, ease of interpretation depends on the order of the categories. Changes over time will be very easy to observe for the first/bottom category. You can easily see, for example that for category 1 October is slightly higher than May (0.6 vs 0.53). However, it’s takes a lot more effort to compare Oct to May for category even though the difference is bigger (0.56 to 0.39). If I asked you which category grew more from Oct to May, you’d be pretty hard pressed to answer that (without the numbers).

The 100% stacked bar chart ignores changes over time and only focuses on the change in composition. This is useful for when those changes either don’t matter, or are very small. It gives you a clearer picture for how composition changes. However, it still has the same problem that it is hard to get insights about the changes for those middle categories. It gets unwieldy around 4 or 5 categories. Consider the following example. The total data series is stationary (always sums to 1). You can clearly see that category 10 increased a lot and category 1 decreased. It’s harder to see what is happening to the interior categories, like category 5. It looks like it is staying pretty much the same. Can you notice the consistent decrease from July to Dec (goes monotonically from 8.7% to 5.6%).

These changes are hard to spot because each series is not aligned with itself. What the Dart (ok, fine. the aligned stacked bar chart) does is put white space between each series so that each series is aligned to itself. In the examples I centered aligned them, but they could just as easily chosen to align them to the bottom or the top. With this format you can easily look across the series to see how it changes over time, and make those comparisons. You also can look at a particular time period and see the composition of the total. It works with both with data that changes over time, and data that is stationary. I am not sure why we don’t see this type of graph! I’ve created a hack to make it work in MS excel (contact me if you’re interested) but hopefully this could be adapted into the program to improve chart readability!

Winter Olympics Medals Over Time – Post Olympics Update

The Winter Olympics have now closed. The host nation Russia walked away with both the most medals and the most golds. The increase was pretty dramatic. Russia only managed to pick up 3 gold medals in Vancouver but napped 13 in Sochi. Canada, while lacking the home country bounce they had in 2010, continued to be a top competitor. Norway had another good games reinforcing the fact that their meager haul in Turin 2006 was just an aberration.

A lot of pundits have called this year’s games a huge disappointment for the USA (ESPN: Team USA disappoints in Sochi)However, by historical standards the 2014 games were pretty standard. The USA pulled in the same number of gold medals as 2010, 2006 and only one down from our all time high in 2002 (which we hosted). In both the Summer and Winter Olympics consistency seems to be the name of the game for the United States as I pointed out in my first summer Olympics post. Though, as I mentioned before, the number of events and medals have been increasing so in terms of the percent of gold medals the United States is slipping.

Winter Olympic Gold Medal Counts 1988-2

The total medal counts show pretty much the same story. The United States failed to defend their total medal lead – however, hopefully this time it won’t take us 78 years to be back on top! As with the Summer Olympics it appears that the home-country bounce is more pronounced in the count of gold medals rather than number of total medals. To me this is counter-intuitive and is begging for some good statistical analysis to investigate potential systematic judging bias in favor of the home country. Perhaps if I find myself with some extra time (highly unlikely) I can look into that.

Winter Olympic Total Medal Counts 1988-2

Unlike host country Russia, Germany continued to struggle to regain their passed Winter Olympics glory. They took home the fewest gold medals since 1972 (combining East and West) and fewest total medals since 1968! (See the German/Russian dominance in my original Winter Olympics post)

Probably the most surprising country was the Netherlands. Because historically they were not a big player, I did not even include them in my charts. At one time they led the total medal count and finished with 24 medals and 8 golds. Previously the Netherlands’ highest haul had been 11 medals and 5 golds  (1998) and in 2010 they took home a total of 8 medals.