Important Python Packages
Apparently, pypi keeps track of python package downloads and moreover publishes those numbers in a BigQuery table. In other words, we can easily see the most downloaded python packages and with a little bit of work, their dependencies. These packages and their direct dependencies form a natural Directed Graph that we can explore using networkx. Networkx is fairly popular python package for network analysis.
Alas, first step is getting the list of packages and their corresponding download counts for the last 3 months from pypi. For sake of simplicity, I will just use the top 1000 packages for analysis:
which results in the following csv file:
Hmm, botocore
was the most downloaded package in the last 3 months.
Anyway, what we would like to do is find out the dependencies of these packages which we will later use to form a graph for analysis. Basically, download and install each package in a virtual environment, find out its direct dependencies and uninstall:
which results in a file similar to below:
The first line of that file basically says that botocore
depends on urllib3
, jmespath
and python-dateutil
packages. Unfortunately, I couldn’t install all 1000 packages because some required windows, others didn’t support python-3.9 (I am looking at you tensorflow, well, sub-packages of tensorflow) etc and I didnt want to spend too much time trying to get all 1000 packages installed.
Another thing to keep in mind is that we are disregarding any optional dependenies. In other words, to take Dask as an example, we are only considering the core dependencies of Dask as we are running pip install dask
rather than pip install dask[complete]
in the above script.
Let’s unstack the dependencies:
to get the following file:
so that edges are obvious and easy to import. Finally, we can form our graph for analysis:
Yey, finally some numbers:
That low density number is pretty normal for this type of graph afaik.
Good. There are no isolates, i.e. nodes without any edges.
Centralities
One of the goals of network analysis is to identify outstanding actors. Traditionally, this is measured via centrality. As we can measure importance of a node in several different ways - number of edges, distance to all other nodes etc - there are several centrality measures (several hundred to be more precise). We can do a centrality analysis by calculating different centralities for our graph and comparing them:
Each centrality measure (except nx.hits
) returns a dictionary with nodes as keys and centralities as values. nx.hits
returns a list of 2 dictionaries. We also normalize harmonic closeness centrality as it is the only measure on the list that is not automatically normalized.
centralities.corr() calculates all pairwise correlations between the centralities and returns an 8*8 symmetric dataframe. More than half of the values in the dataframe are redundant. So, use np.tri() to generate a lower left triangle of 1’s and mask the duplicates by multiplying the dataframe with the mask. Finally, put the values in a pd.Series, sort the values and display the last 5 rows.
Centrality measures basically form 4 groups in our graph. One group consists of closeness, harmonic closeness, pagerank, degree and authorities. eigenvector, hubs and betweenness form their own individual groups.
PageRank was developed by Google (and named after Larry Page) to rank web pages. The rank of a node in the network is calculated as the probability that a person randomly traversing the edges will arrive at the node. The node with the highest PageRank is the most attractive: no matter where you start, this node is the most likely destionation. In our graph, six
is the package with highest PageRank.
The betweenness centrality measures the fraction of all possible shortest paths, geodesics to be more precise, that pass through a node. If the betweenness is high, the node is potentially a crucial go-between and the removal of such a node possibly splits the network into disconnected parts. In our graph, requests
is the package with highest betweenness centrality.
Eigenvector centrality has a somewhat recursive definition. Basically, it says “tell me who your friends are and I will tell you who you are”. High eigenvector centrality identifies nodes surrounded by other high eigenvector central nodes and can be used to locate groups of nodes with high prestige. In our graph, pycparser
is the node with the highest eigenvector centrality.
We can easily check the correlations visually as well:
which results in the following graph:
There does not seem to be much of a relationship as expected from the correlation numbers.
Communities
In a directed graph, strongly connected components are the maximal set of nodes such that for every pair of nodes u and v, there is a directed path from u to v and from v to u.
927 different components! But this somewhat expected as we do not want circular dependencies between our packages (A depends on B depends on C depends on A) as that would indeed make it somewhat of a challange to install the relevant packages.
Weakly connected components are similar but we can traverse the links in the opposite direction as well, i.e. we can treat the links as undirected.
In other words, we have 13 different weakly connected components in our graph.
The problem of finding groups of nodes in networks is called community detection. Community detection is a challenging task - optimal partitioning of a graph is NP-complete - but approximate solutions do a pretty good job. Networkx provides 2 main community detection algorithms:
And the second algorithm is called Girvan-Newman method. The Girvan–Newman algorithm detects communities by progressively removing edges from the original network. The connected components of the remaining network are the communities. Instead of trying to construct a measure that tells us which edges are the most central to communities, the Girvan–Newman algorithm focuses on edges that are most likely “between” communities. As the graph breaks down into pieces, the tightly knit community structure is exposed and the result can be depicted as a dendrogram.
We can use networkx’s draw_network method to visualize our communities. However, gephi provides much better visualization so we will use that instead.
Each community is a different color. Lines represent dependencies and the name of a package is proportional to its centrality measure. The upper right purple community is Microsoft Azure packages for example. And the lower leftish orange community mostly consists of data science packages.
six
and requests
(and setuptools
) seem to be the most important packages in our graph.
One final point:
Our graph is a DAG as expected. That also means we can sort our nodes. Topological sorting is a linear ordering of vertices such that for every directed edge from u to v, node u comes before v in the ordering. Printing the first and last 10 for our graph, we get:
We can think of the first elements (list starting with awscli
in the above example) as frontends. Basically, they are the packages for human interaction and other packages do not really depend on them. Correspondingly, the second list (the list starting with pyjwt
above) can be considered as backends. Those are the packages that mostly other packages depend on.