The Strength of Structural Diversity in Online Social Networks

Understanding the way individuals are interconnected in social networks is of prime significance to predict their collective outcomes. Leveraging a large-scale dataset from a knowledge-sharing website, this paper presents an exploratory investigation of the way to depict structural diversity in directed networks and how it can be utilized to predict one's online social reputation. To capture the structural diversity of an individual, we first consider the number of weakly and strongly connected components in one's contact neighborhood and further take the coexposure network of social neighbors into consideration. We show empirical evidence that the structural diversity of an individual is able to provide valuable insights to predict personal online social reputation, and the inclusion of a coexposure network provides an additional ingredient to achieve that goal. After synthetically controlling several possible confounding factors through matching experiments, structural diversity still plays a nonnegligible role in the prediction of personal online social reputation. Our work constitutes one of the first attempts to empirically study structural diversity in directed networks and has practical implications for a range of domains, such as social influence and collective intelligence studies.


Introduction
Recent years have witnessed the emergence and rapid proliferation of many social applications and media platforms. As the backbone of so many online social systems, network structure is becoming a complex and subtle force that drives the dynamics of a wide variety of social processes. In some cases, we seek to leverage social networks to maintain social capital, encourage the adoption of new products, or promote positive behaviors like cooperation and physical exercise [1][2][3][4][5][6][7][8][9][10], while in others to eliminate the spread of infectious diseases and fake news or change negative behaviors like conflict and unhealthy eating [11][12][13][14][15].
However, the information redundancy and prevalence of similarity in one's social realm may limit his/her potential of exposure to diverse information and interaction with people from different backgrounds, thereby reducing the efficiency of social networks, preventing the diffusion of innovative ideas, weakening the power of social influence and undermining the wisdom of crowds [29][30][31][32][33].
The recent availability of vast and fine-grained data of human activities in online social networks provides an unprecedented opportunity to investigate the nuanced or subtle social effects induced by social context diversity. Structural diversity, which is aimed at quantifying the diversity of one's social context from the view of component count among social neighbors, has been shown useful in predicting specific social processes in undirected networks [2,6], while for directed networks, the strength of structural diversity still remains an open question [34]. In directed networks, the network connectivity patterns become more complicated, and network ties may function quite differently according to the direction [28,35]. Moreover, exposure to similar users could have a better chance to result in the similarity between two individuals even without an explicit social tie between them [36], which is largely overlooked by previous studies on structural diversity.
Online social reputation is the consensus public opinion of an individual or entity based on the ratings from members in a social network [37][38][39]. It helps to foster trust among online users and is one of the most valuable assets in our online social lives [40,41]. Using data from an online knowledge-sharing platform, this paper addresses the problem of how to quantify the structural diversity of an individual in directed networks and the role of it in empirically predicting personal social reputation. We first consider the weak and strong connectivity patterns among one's social contacts and find that personal social reputation is positively correlated with the number of weakly or strongly connected components in one's contact neighborhood. Conditional on how many followers one has, individuals whose followers come from more diverse social backgrounds tend to have higher social reputations on the platform. Regression analysis indicates that structural diversity measures via weak and strong connectivity among one's neighbors yield better predictions of personal social reputation than the number of followers one has. We further take the coexposure network of one's social neighbors into consideration, which goes beyond the sheer number of social contacts or connected social components in one's contact neighborhood and provides an additional ingredient to predict personal social reputation.
To eliminate the effects of confounders from structural diversity in the prediction of personal social reputation, we conduct a series of matching experiments. After synthetically controlling possible confounding factors, we present compelling evidence that for individuals with an equal number of followers, same gender, and similar activity-related patterns, those whose followers come from more diverse social backgrounds (measured by structural diversity) are more likely to have higher social reputations. Our work presents one of the first attempts to empirically study structural diversity in directed networks and demonstrates the potential utility of structural diversity in predicting personal social outcomes and could also shed significant insights into the study of a range of social processes such as the diffusion of innovations, the spread of infectious diseases, and the influence maximization problem.

Results
2.1. Network Data and Social Reputation Index. We collect data from Zhihu, a Chinese knowledge-sharing website which claims to have more than 200 million registered users. On this platform, social ties are created when users choose to follow other accounts. Starting from a randomly selected user, we collect follower and followee lists of 234,834 users in a snowball sampling manner. These 234,834 users are hereafter referred to as ego users since we know their complete followers and followees. In addition, we also collect the followee lists of the ego users' followers (user accounts two hops away from the ego users). The whole social network is then constructed based on the explicit social ties between user accounts. In total, the constructed network covers more than 10 million user accounts and 300 million directed social ties. For these~230,000 ego users, we further collect their popularity data, including how many upvotes, thanks, and favorites they have received, which indicate their social reputations on the platform. Other kinds of informative data are also collected, such as how many questions they have asked and answered, self-reported gender, followed topics, and questions. Note that all the collected data are based on public information on the platform and do not include any users with privacy restrictions.
The distributions of the number of received upvotes, thanks, and favorites are illustrated in Figure 1, where each point in the panel indicates the fraction or relative frequency PðkÞ of users with a specific quantity which equals k. The dashed grey line in each panel shows the power-law fitting [42,43] for each distribution, where the power-law exponents are 1.342, 1.473, and 1.400, respectively. As the distributions span several orders of magnitude, the inequality of personal popularity is very striking. Considering the fact that these three popularity measures are highly correlated and very sparse, we adopt nonnegative matrix factorization (NMF) [44][45][46][47]-a widely used dimensionality reduction technique-to collapse them into a single measure which we term the social reputation index (see Methods for details on the construction of the social reputation index).

Weak and Strong Connectivity.
To quantify the structural diversity of a given node in directed networks, we first consider the weak and strong connectivity between the neighboring nodes. Note that for any two nodes u and v in a directed network, u and v are said to be weakly connected as long as there is a path linking u and v regardless of the direction of the path and strongly connected if and only if there is at least a directed path from u to v as well as a directed path from v to u. Therefore, according to the weak or strong connectivity patterns between nodes, a directed network can be decomposed into several social components.
There may exist multiple approaches to quantify the structural diversity of individuals, but the most simple and straightforward way would be the number of connected social components in one's contact neighborhood [2,34]. Taking networks with three nodes as examples, Figure 2(a) shows how directed networks are projected into different numbers of social components based on the weak or strong connectivity patterns. For a constructed ego network with the ego node/user (we use the terms node and user interchangeably throughout the paper) located at the hub of the wheel, Figure 2(b) further illustrates how the corresponding structural diversity measures are computed compared with indegree (number of followers). For the given ego user in Figure 2(b), (i) indegree is equal to the number of followers, which is 9 in the given example; (ii) weak diversity measure is equal to the number of weakly connected components in the contact neighborhood of the ego user, which is 4 in the given example; and (iii) strong diversity measure is equal to the number of strongly connected components in the contact neighborhood, which is 6 in the given example. In fact, we have the property that weak diversity measure ≤ strong diversity measure ≤ indegree for any given ego user. For thesẽ 230,000 ego users, more than 91% of them have less than 20 followers, and only 1,442 of them have more than 1,000 followers ( Figure 2(c), c1). As shown in Figure 2(c), the distributions of their weak and strong diversity measures are also very highly skewed. The dashed grey line in each panel shows the power-law fitting [42,43] for each distribution, where the power-law exponents are 1.641, 1.690, and 1.659, respectively. As expected, the social reputation index increases with all three metrics, with Spearman's rank correlation coefficients as 0.646 for indegree (p < 0:001), 0.648 for weak diversity measure (p < 0:001), and 0.647 for strong diversity measure (p < 0:001).
With a closer look at individuals with an equal number of followers (indegree), we find that the social reputation index generally grows monotonically with the increase of weak and strong diversity measures ( Figure 3). In other words, for individuals with the same number of followers, those whose followers come from more diverse social backgrounds (measured by weak or strong diversity measure) are likely to have higher social reputations. Figure 4(a) summarises the prediction accuracy (R 2 ) of social reputation by indegree, weak diversity measure, and strong diversity measure through three separate ordinary least square (OLS) regressions. In each regression, the social reputation index is set as the dependent variable and each measure (log-transformed) is set as the sole independent variable. As shown in the figure, structural diversity measures yield better predictions of online social reputation than indegree with R 2 values equal to 0.691, 0.699, and 0.698 for indegree, weak diversity measure, and strong diversity measure, respectively.
To directly compare these measures, we further adopt L 1 -regularized linear regression, which is also called least absolute shrinkage and selection operator (LASSO) regression. LASSO regression is a standard model in sparse regression and has been widely used for simultaneous estimation and variable selection [48][49][50][51][52] (see Methods for more information on LASSO regression). Note that, with the increase of the regularization level, LASSO would continuously shrink the coefficients of less important features to be zero [49,50]. Figure 4(b) shows the regularization path obtained from LASSO regression where indegree, weak diversity measure, and strong diversity measure (log-transformed) are set as predictors in the prediction of social reputation. With the decrease of the regularization level, weak and strong diversity measures are first selected before indegree, which may imply that it is not how many followers one has but rather the structural diversity of one's followers that matters in capturing one's social reputation.

Social Bridges.
To quantify one's social context diversity from the view of component count, previous studies have focused exclusively on explicit and direct social ties in one's contact neighborhood. However, as social neighbors provide a reliable way to infer personal characteristics [24,27,28,[53][54][55], it is reasonable to speculate that two individuals who are exposed to a similar set of users will tend to be similar with each other, even without a direct social tie between them. Therefore, the common followees between them could act as social bridges that implicitly "link" two unconnected individuals or social components in the network [36]. Figure 5(a) illustrates the process of how we apply social bridges to capture the implicit structural diversity of ego nodes. Figure 5(b) gives another example with all the followers (denoted by black circles) isolated (i.e., no direct social ties between the followers). However, after social bridges are taken into consideration, user a and user b are likely to form one connected component (see Figures S3 and S4 of the Supplementary Materials for more examples). Specifically, for two followers i and j of an ego user, we adopt Jaccard similarity of their followee sets to determine whether there is a "bridged connection" between them: JaccardSimði, jÞ = j F i ∩ F j j/jF i ∪ F j j, where F i and F j denote the followee sets of i and j, respectively; ∩ and ∪ denote the intersection and union operators of two sets, respectively; and j·j denotes the size of a given set. A bridged connection exists between i and j when JaccardSimði, jÞ is larger than a given (a) Weak diversity measure Strong diversity measure Strong connectivity Weak connectivity # components threshold. A small threshold indicates that a bridged connection exists between two individuals as long as they share a small fraction of followees, while a large threshold requires two individuals sharing a large proportion of followees to determine the existence of a bridged connection. Note that the bridged connection meets the requirement of strong connectivity between two nodes. Therefore, if two nodes share a bridged connection, they are also weakly and strongly connected. After social bridges are considered, the number of weakly and strongly connected components is termed the enhanced diversity measures via social bridges, denoted as "Weak connectivity+social bridges" and "Strong connectivity+social bridges" respectively. For ease of computation, ego users with more than 30,000 followers (i.e., 88 out of 234,834 ego users) are omitted in the analysis. Figure 5(c) shows the prediction accuracy (R 2 ) of social reputation by diversity measures via social bridges with the change of the Jaccard similarity threshold. Note that each R 2 value is obtained via OLS regression with the social reputation index set as the dependent variable and diversity measure (logtransformed) under each condition set as the sole independent variable. For weak connectivity, a threshold around 0.25 achieves the best performance; for strong connectivity, a threshold around 0.2 achieves the best performance.
In Figure 5(d), we summarise the best prediction accuracy (R 2 ) achieved by each diversity measure via different approaches, including two approaches that are proposed for undirected networks-k-core decomposition and k-brace decomposition [2]. Similar to the above, to facilitate the computation and make these diversity measures comparable, only ego users with no more than 30,000 followers are included in the analysis. We also make some slight changes to make k-core and k-brace decomposition applicable in directed networks (see Methods for details). Specifically, k -core decomposition and k-brace decomposition achieve the best prediction accuracy (R 2 ) when k is set to be 2 and 1, respectively. As we can see from Figure 5(d), social bridges provide the most additional ingredient to predict personal online social reputation, and diversity measure via weak connectivity and social bridges yields the best prediction performance. Furthermore, Figure 5(e) shows the regularization path of the L 1 -regularized linear regression (LASSO regression) model with several diversity measures (log-transformed) set as predictors. As shown in the figure, with the decrease of the regularization level, diversity measures via social bridges are selected before other diversity measures, which further suggest the effectiveness of social bridges in the prediction of online social reputation.

Robustness Analysis.
In the previous sections, we have shown that individuals with higher levels of structural diversity tend to have higher online social reputations. However, the positive correlation between structural diversity and online social reputation may be biased by other factors. For example, as shown in Figure 6(a), the answer count is also positively correlated with online social reputation (Spearman's r s = 0:858, p < 0:001). In other words, the more answers a user has contributed to the knowledge-sharing community, the higher the social reputations he/she tends to have on the platform. Therefore, personal online social reputation may also be induced by contributed answers rather than merely the structural diversity of individuals. We also find that gender could be another potentially confounding factor, as male users tend to receive higher social reputations than female users and users of unknown gender on the platform (one-way  1 2 3 1 2 3 4 1 2 3 4 5 1 2 3 4 5 6 1 2 3 4 5 6 7 Number of social components 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9   Combining with several other activity-related factors, such as question count and article count, a series of matching experiments are then conducted to distinguish the effects of structural diversity from these possible confounding factors in capturing personal social reputation. Propensity score matching (PSM) is a widely used method for matching experiments in the literature [25,[56][57][58][59][60][61]. The key intuition of PSM is to match the treatment group with a control group whose members do not receive the treatment but are statistically indistinguishable or at least only marginally different (within a reasonable limit) from the treatment group on all observable covariates. We use the quantified diversity measure via weak connectivity and social bridges to depict the structural diversity of users in the network. Here, in our setting, a user is said to be treated (treatment group) if his/her diversity measure is larger than or equal to m (m is a given integer), otherwise untreated (control group). We do exact matching on indegree and gender (i.e., matched pairs have equal indegree and the same gender) and propensity score matching on other covariates (see Table S6 of the Supplementary Materials for detailed covariates controlled in the matching experiments). The difference of social reputation index in each matched pair (treatment-control) indicates the relative social reputation induced by the increase of structural diversity.
After matching, we obtain a well-balanced dataset with all standardized mean differences between the treated and untreated groups being less than 0.25. Figure 6(c) shows the differences of social reputation index between matched pairs (for space constraint and simplicity, we only present results when m is set to be in the range ½2, 10). As shown in the figure, after ruling out several potentially confounding factors, users with higher levels of structural diversity (assigned as treated users in the matching experiments) would still tend to have higher online social reputations (paired t-test, p < 0:001 for all matching experiments). Taken together, after rigorously controlling possible confounders, matching experiments provide compelling evidence for the role of structural diversity in elevating personal online social reputation.

Discussion
The advent of social networking sites and knowledge-sharing platforms has radically shifted the way we consume information, acquire knowledge, and exchange ideas. As social ties among individuals provide the primary pathways along which interactions occur, the way we are connected and embedded in social networks is thought to affect various personal social outcomes, ranging from personal health to socioeconomic characteristics.
Taking advantage of network data which cover more than 10 million users (including more than 230,000 ego users) from an online knowledge-sharing platform, our study highlights the importance of structural diversity in online social networks and suggests an alternate perspective for people to accumulate their social capital, for policy makers to make appropriate interventions and for market operators to set up effective campaigns. Our findings are also of prime significance to understand why network structure matters in a range of social and economic domains. For example, individuals who are located in a diversified social context are generally accessible to novel information and ideas, which has important implications for viral marketing and fake news research. Moreover, as we live in such a connected world of overloaded information, how we aggregate opinions around us (e.g., adopt information from diverse backgrounds) to arrive at a reliable and accurate estimation is not only crucial to make better decisions but also important to improve the collective intelligence.
Our work is subject to a number of limitations. Our results are the product of one study based on the collected data from an online knowledge-sharing platform. Therefore, additional studies are needed to validate our findings in other kinds of social applications or domains. In this paper, we use snowball sampling during the data collection, but this approach may over-or undersample some data and induce bias to the results. To quantify the structural diversity of individuals, this work mainly considers the number of connected social components in one's social realm, but other network factors, such as the weight of ties and other structural  properties, may help to achieve that goal and are also worth exploring in future works. Although social bridges help to capture the implicit structural diversity of individuals, the computation will require a large amount of computing time and memory resources, especially for users with a large number of followers. In our study, the matching experiments have already accounted for several observed characteristics of users, but due to the limitation of data availability, the esti-mation results may still be biased without properly controlling those unobserved or unmeasured confounding factors. We emphasize that our results are built upon correlation analysis on observational data, thus not implying causality sufficiently. The current study mainly considers social ties induced by following relationships, but social connections induced by behavioral changes, such as comment or retweet, may provide another way to infer the interrelationships

Construction of Social
Reputation Index. The current study adopts nonnegative matrix factorization (NMF) to construct the social reputation index from the original data. NMF is an unsupervised approach for dimensionality reduction with the constraints that the input and output matrices do not contain any negative elements [44][45][46][47]. Specifically, for a given nonnegative data matrix V, NMF attempts to find an approximate factorization: V ≈ W · H such that the original matrix V can be decomposed into two nonnegative submatrices W and H, with the goal of minimizing the reconstruction error between V and W · H. In our setting, the decomposition of an original matrix with n users and m dimensions of features can be decoded as V n×m ≈ W n×r · H r×m , where r is a given parameter prior to the matrix decomposition and indicates the expected dimension after NMF. Here in our study, V n×m is the original data matrix while n equals 234,834 (i.e., the number of ego users in the sample) and m equals 3 as there are three types of popularity measures (i.e., number of received upvotes, thanks, and favorites), whereas r is set to be 1 as we aim to obtain a single measure of social reputation. After NMF, the reduced submatrix W n×r is denoted as the social reputation index.
In practice, these three popularity measures are first logtransformed by log 10 ðx + 1Þ and then fed into NMF simultaneously. Once the social reputation index (the first submatrix W after NMF decomposition) is obtained, we normalize it to the range [0, 100] as follows: given a vector W indicating the social reputation index, every element of W is normalized as W i = 100 × ðW i − W min Þ/ðW max − W min Þ, where W max and W min are the maximum and minimum values of W, respectively.

LASSO Regression.
Generally speaking, OLS regression yields nonzero estimates to the coefficients of all the features. But LASSO adds the L 1 penalty of the coefficients (i.e., the sum of the absolute value of the estimated coefficients) as the regularization term to the loss function of OLS regression (see Refs. [49][50][51] for further technical details). Therefore, when the regularization level is set to be zero, LASSO regression is equivalent to OLS regression. With the increase of the regularization level, LASSO would gradually force the coefficients of less important features to be zero. When the regularization level becomes sufficiently large, all the estimated coefficients will be zero.

k-Core and k-Brace
Decomposition. k-Core decomposition and k-brace decomposition have been shown useful in depicting structural diversity in undirected networks [2]. Considering the fact that users with completely isolated followers tend to have higher social reputations than their counterparts, node removal in k-core and k-brace decomposition would not work here. In this regard, we make some slight changes to make k-core and k-brace decomposition applicable in the current scenario. In the contact neighborhood formed by the followers of an ego user, only edges that are affiliated to specific nodes are removed (instead of node removal). For k-core decomposition, we use the degree (i.e., indegree+outdegree) of nodes to do the decomposition process: edges that are affiliated with nodes whose degrees are less than k are repeatedly removed. For k-brace decomposition, we transform the directed networks to undirected ones and then implement the decomposition process: edges whose two endpoints share less than k neighbors are repeatedly removed. After decomposition, the number of (weakly) connected components in the decomposed contact neighborhood is named the diversity measure via k-core or k-brace decomposition.

Data Availability
The relevant data that support the findings of this paper are available from the corresponding authors upon reasonable request.

Conflicts of Interest
The authors declare no competing interest.  Gender differences in terms of social reputation index, where NA refers to users whose gender is not disclosed. (c) Matching results where the treatment assignment is set as whether the quantified structural diversity measure is larger than or equal to m (m ∈ ½2, 10). Error bars are 95% confidence intervals.  Figure S3: illustration of the function of social bridges in an example ego network. Figure S4: examples of social bridges in ego networks. Table S1: data descriptions of three popularity measures. Table S2: correlation coefficients between three popularity measures. Table  S3: data descriptions of other activity-related factors. Table  S4: data descriptions of indegree, weak diversity measure, and strong diversity measure. Table S5: correlation coefficients between indegree, weak diversity measure, and strong diversity measure. Table S6: covariates accounted for in propensity score matching. (Supplementary Materials)