A Trio Neural Model for Dynamic Entity Relatedness Ranking

Tu Nguyen
L3S Research Center
[email protected]
&Tuan Tran
Robert Bosch GmbH
[email protected]
&Wolfgang Nejdl
L3S Research Center
[email protected]
Abstract

Measuring entity relatedness is a fundamental task for many natural language processing and information retrieval applications. Prior work often studies entity relatedness in static settings and an unsupervised manner. However, entities in real-world are often involved in many different relationships, consequently entity-relations are very dynamic over time. In this work, we propose a neural network-based approach for dynamic entity relatedness, leveraging the collective attention as supervision. Our model is capable of learning rich and different entity representations in a joint framework. Through extensive experiments on large-scale datasets, we demonstrate that our method achieves better results than competitive baselines.

1 Introduction

Measuring semantic relatedness between entities is an inherent component in many text mining applications. In search and recommendation, the ability to suggest most related entities to the entity-bearing query has become a standard feature of popular Web search engines Blanco et al. (2013). In natural language processing, entity relatedness is an important factor for various tasks, such as entity linking Hoffart et al. (2012) or word sense disambiguation Moro et al. (2014).

However, prior work on semantic relatedness often neglects the time dimension and consider entities and their relationships as static. In practice, many entities are highly ephemeral Jiang et al. (2016), and users seeking information related to those entities would like to see fresh information. For example, users looking up the entity Taylor Lautner during 2008–2012 might want to be recommended with entities such as The Twilight Saga, due to Lautner’s well-known performance in the film series; however the same query in August 2016 should be served with entities related to his appearances in more recent films such as “Scream Queens”, “Run the Tide”. In addition, much of previous work resorts to deriving semantic relatedness from co-occurence-based computations or heuristic functions without direct optimization to the final goal. We believe that desirable framework should see entity semantic relatedness as not separate but an integral part of the process, for instance in a supervised manner.

In this work, we address the problem of entity relatedness ranking, that is, designing the semantic relatedness models that are optimized for ranking systems such as top-kk entity retrieval or recommendation. In this setting, the goal is not to quantify the semantic relatedness between two entities based on their occurrences in the data, but to optimize the partial order of the related entities in the top positions. This problem differs from traditional entity ranking Kang et al. (2015) in that the entity rankings are driven by user queries and are optimized to their (ad-hoc) information needs, while entity relatedness ranking also aims to uncover the meanings of the the relatedness from the data. In other words, while conventional entity semantic relatedness learns from data (editors or content providers’ perspectives), and entity ranking learns from the user’s perspective, the entity relatedness ranking takes the trade-off between these views. Such a hybrid approach can benefit applications such as exploratory entity search Miliaraki et al. (2015), where users have a specific goal in mind, but at the same time are opened to other related entities.

We also tackle the issue of dynamic ranking and design the supervised-learning model that takes into account the temporal contexts of entities, and proposes to leverage collective attention from public sources. As an illustration, when one looks into the Wikipedia page of Taylor Lautner, each navigation to other Wikipedia pages indicates the user interest in the corresponding target entity given her initial interest in Lautner. Collectively, the navigation traffic observed over time is a good proxy to the shift of public attention to the entity (Figure 1).

In addition, while previous work mainly focuses on one aspect of the entities such as textual profiles or linking graphs , we propose a trio neural model that learns the low level representations of entities from three different aspects: Content, structures and time aspects. For the time aspect, we propose a convolutional model to embed and attend to local patterns of the past temporal signals in the Euclidean space. Experiments show that our trio model outperforms traditional approaches in ranking correlation and recommendation tasks. Our contributions are summarized as follows.

  • We formulate dynamic entity relatedness ranking and optimize directly for time-sensitive pairwise ordering rather than static similarity scores.

  • We introduce a temporal convolutional module with a monotonic time-decay weighting scheme, enabling the model to embed temporal signals and emphasize recency-dependent local patterns.

  • We propose a trio neural ranking framework that jointly models content-, graph-, and time-based views of entities, yielding jmutually informative representations tailored for ranking.

Refer to caption
Figure 1: The dynamics of collective attention for related entities of Taylor Lautner in 2016.

2 Related Work

2.1 Entity Relatedness and Recommendation

Most of existing semantic relatedness measures (e.g. derived from Wikipedia) can be divided into the following two major types: (1) text-based, (2) graph-based. For the first, traditional methods mainly focus on a high-dimensional semantic space based on occurrences of words ( Gabrilovich and Markovitch (2007); Gabrilovich and Markovitch (2009)) or concepts ( Aggarwal and Buitelaar (2014)). In recent years, embedding methods that learn low-dimensional word representations have been proposed.  Hu et al. (2015) leverages entity embedding on knowledge graphs to better learn the distributional semantics. Ni et al. (2016) use an adapted version of Word2Vec, where each entity in a Wikipedia page is considered as a term. For the graph-based approaches, these measures usually take advantage of the hyperlink structure of entity graph Witten and Milne (2008); Guo and Barbosa (2014). Recent graph embedding techniques (e.g., DeepWalk Perozzi et al. (2014)) have not been directly used for entity relatedness in Wikipedia, yet its performance is studied and shown very competitive in recent related work Zhao et al. (2015); Ponza et al. (2017).

Entity relatedness is also studied in connection with the entity recommendation task. The Spark Blanco et al. (2013) system firstly introduced the task for Web search,  Yu et al. (2014); Zhang et al. (2016a) exploit user click logs and entity pane logs for global and personalized entity recommendation. However, these approaches are optimized to user information needs, and also does not target the global and temporal dimension. Recently,  Zhang et al. (2016b); Tran et al. (2017) proposed time-aware probabilistic approaches that combine ‘static’ entity relatedness with temporal factors from different sources. Nguyen et al. (2018) studied the task of time-aware ranking for entity aspects and propose an ensemble model to address the sub-features competing problem.

2.2 Neural Network Models

Neural Ranking. Deep neural ranking among IR and NLP can be generally divided into two groups: representation-focused and interaction-focused models. The representation-focused approach Huang et al. (2013) independently learns a representation for each ranking element (e.g., query and document) and then employ a similarity function. On the other hand, the interaction-focused models are designed based on the early interactions between the ranking pairs as the input of network. For instance,  Lu and Li (2013); Guo et al. (2016) build interactions (i.e., local matching signals) between two pieces of text and trains a feed-forward network for computing the matching score. This enables the model to capture various interactions between ranking elements, while with former, the model has only the chance of isolated observation of input elements.

Attention networks. In recent years, attention-based NN architectures, which learn to focus their “attention” to specific parts of the input, have shown promising results on various NLP tasks. For most cases, attentions are applied on sequential models to capture global context  Luong et al. (2015). An attention mechanism often relies on a context vector that facilitates outputting a “summary” over all (deterministic soft) or a sample (stochastic hard) of input states. Recent work proposed a CNN with attention-based framework to model local context representations of textual pairs  Yin et al. (2016), or to combine with LSTM to model time-series data Ordóñez and Roggen (2016); Lin et al. (2017) for classification and trend prediction tasks.

3 Problem

3.1 Preliminaries

We denote as named entities any real-world objects registered in a database. Each entity has a textual document (e.g. content of a home page), and a sequence of references to other entities (e.g., obtained from semantic annotations), called the entity  link profile. All link profiles constitute an entity linking graph. In addition, two types of information are included to form the entity collective attention.

Temporal signals. Each entity can be associated with a number of properties such as view counts, content edits, etc. Given an entity ee and a time point nn, given DD properties, the temporal signals set, in the form of a (univariate or multivariate) time series X𝐑D×TX\in\mathbf{R}^{D\times T} consists of TT real-valued vector xnT,,xn1x_{n-T},\cdots,x_{n-1} , where xt𝐑Dx_{t}\in\mathbf{R}^{D} captures the past signals of ee at time point tt.

Entity Navigation. In many systems, the user navigation between two entities is captured, e.g., search engines can log the total click-through of documents of the target entity presented in search results of a query involving the source entity. Following learning to rank approaches Kang et al. (2015), we use this information as the ground truth in our supervised models. Given two entities e1,e2e_{1},e_{2}, the navigation signal from e1e_{1} to e2e_{2} at time point tt is denoted by y{e1,e2}ty_{\{e_{1},e_{2}\}}^{t}.

3.2 Problem Definition

We consider the task of quantifying and ranking the semantic relatedness between entities as it evolves over time. Rather than assuming a single static relatedness function, we allow the notion of relatedness to vary with time and model it as a family of functions F={ft}t𝒯F=\{f_{t}\}_{t\in\mathcal{T}}, where each ftf_{t} reflects the relations observed at time tt.

Dynamic Entity Relatedness.

For any time point tt and any pair of entities (es,et)(e_{s},e_{t}) consisting of a source entity ese_{s} and a target entity ete_{t}, the dynamic relatedness score is given by a function

ft:×0.f_{t}:\mathcal{E}\times\mathcal{E}\rightarrow\mathbb{R}_{\geq 0}.

We require each ftf_{t} to satisfy the following properties:

  • Asymmetry:ft(ei,ej)f_{t}(e_{i},e_{j}) need not equal ft(ej,ei)f_{t}(e_{j},e_{i}), reflecting directional relationships.

  • Non-negativity:ft(ei,ej)0f_{t}(e_{i},e_{j})\geq 0 for all entity pairs.

  • Indiscernibility:ei=eje_{i}=e_{j} implies ft(ei,ej)=1f_{t}(e_{i},e_{j})=1, corresponding to maximal self-relatedness.

We make no assumptions of symmetry, transitivity, or metric structure, allowing ftf_{t} to capture complex and time-varying semantic interactions.

Dynamic Entity Relatedness Ranking.

Given a source entity ese_{s} and a time point tt, the goal is to rank a set of candidate target entities {e1,,en}\{e_{1},\ldots,e_{n}\} according to their dynamic relatedness scores ft(es,e1),,ft(es,en)f_{t}(e_{s},e_{1}),\ldots,f_{t}(e_{s},e_{n}). The output is an ordering that reflects the relative strength of entity–entity associations at time tt.

4 Approach Overview

4.1 Datasets and Their Dynamics

In this work we use Wikipedia data as the case study for our entity relatedness ranking problem due to its rich knowledge and dynamic nature. It is worth noting that despite experimenting on Wikipedia, our framework is universal can be applied to other sources of entity with available temporal signals and entity navigation. We use Wikipedia pages to represent entities and page views as the temporal signals (details in section 6.1).

Clickstream. For entity navigation, we use the clickstream dataset generated from the Wikipedia webserver logs from February until September, 2016. These datasets contain an accumulation of transitions between two Wikipedia articles with their respective counts on a monthly basis. We study only actual pages (e.g. excluding disambiguation or redirects). In the following, we provide the first analysis of the clickstream data to gain insights into the temporal dynamics of the entity collective attention in Wikipedia.

Refer to caption
(a) Click times distribution
Refer to caption
(b) Correlation of top-k entities
Refer to caption
(c) Correlation by # of navigations
Figure 2: Click (navigation) times distribution and ranking correlation of entities in September 2016.
Table 1: Statistics on the dynamic of clickstream, ese_{s} denote source entities, ete_{t} related entities.
% new ese_{s} % with new ete_{t} % w. new ete_{t} in top-30 # new ete_{t} (avg.)
08-2016 24.31 71.18 15.54 18.25
04-2016 30.61 66.72 53.44 42.20

Figure 2(a) illustrates the distribution of entities by click frequencies, and the correlation of top popular entities (measured by total navigations) across different months is shown in Figure 2(b). In general, we observe that the user navigation activities in the top popular entities are very dynamic, and changes substantially with regard to time. Figure 2(c) visualizes the dynamics of related entities toward different ranking sections (e.g., from rank 0 to rank 20) of different months, in terms of their correlation scores. It can be interpreted that the entities that stay in top-20 most related ones tend to be more correlated than entities in bottom-20 when considering top-100 related entities.

As we show in Table 1, there are 24.31% of entities in top-10,000 most active entities of September 2006 do not appear in the same list the previous month. And 30.61% are new compared with 5 months before. In addition, there are 71% of entities in top-10,000 having navigations to new entities compared to the previous month, with approx. 18 new entities are navigated to, on average. Thus, the datasets are naturally very dynamic and sensitive to change. The substantial amount of missing past click logs on the newly-formed relationships also raises the necessity of an dynamic measuring approach.

Figure 3 shows the overall architecture of our framework, which consists of three major components: time-, graph- and content-based networks. Each component can be considered as a separate sub-ranking network. Each network accepts a tuple of three elements/representations as an input in a pair-wise fashion, i.e., the source entity ese_{s}, the target entity ete_{t} with higher rank (denoted as e(+)e_{(+)}) and the one with lower rank (denoted as e()e_{(-)}). For the content network, each element is a sequence of terms, coming from entity textual representation. For the graph network, we learn the embeddings from the entity linking graph. For the time network, we propose a new convolutional model learning from the entity temporal signals. More detailed are described as follows.

Refer to caption
Figure 3: The trio neural model for entity ranking.

4.2 Neural Ranking Model Overview

Entity relatedness can in principle be modeled using a point-wise approach that directly predicts a scalar score for each entity pair. However, navigation data is highly skewed, and supervision from long-tail interactions is often noisy. Instead of learning a fully calibrated scoring function, we therefore adopt a pair-wise ranking strategy, which focuses on comparing candidate entities relative to one another. Pair-wise methods have the advantage of preserving partial orders of the underlying relatedness functions ftf_{t} even when ftf_{t} is not globally transitive Cheng et al. (2012), making them well-suited for dynamic entity ranking.

Our architecture follows an interaction-based design in which the model directly processes the triplet (es,e(+),e())(e_{s},e_{(+)},e_{(-)}) to learn their relative compatibility. In contrast to Siamese or representation-based models Chopra et al. (2005), our networks do not share parameters across branches. Parameter sharing would implicitly enforce symmetry ft(es,e)=ft(e,es)f_{t}(e_{s},e)=f_{t}(e,e_{s}), contradicting the inherently asymmetric nature of the relatedness function (Section 3.2).

Each branch of the model is a feed-forward network ψ\psi with input z0z_{0}, hidden layers z1,,zn1z_{1},\ldots,z_{n-1}, and output znz_{n}, where

zi=σ(𝐖izi1+𝐛i),z_{i}=\sigma(\mathbf{W}_{i}z_{i-1}+\mathbf{b}_{i}),

and σ\sigma is a nonlinear activation such as ReLU. Under the trio setup, the overall pair-wise score is the sum of the outputs from the temporal, graph, and content networks:

ϕ(es,e(+),e())=ϕtime+ϕgraph+ϕcontent.\phi(e_{s},e_{(+)},e_{(-)})=\phi_{\mathrm{time}}+\phi_{\mathrm{graph}}+\phi_{\mathrm{content}}.

The next section details the input representations z0z_{0} used by each network.

5 Entity Relatedness Ranking

5.1 Content-based representation learning

To obtain content representations, we leverage both the textual document (word-based) and the link profile (entity-based) of each entity, as introduced in Section 3.1. Given the large word and entity vocabularies, we employ the word hashing technique of Huang et al. (2013), which maps each token to a set of character trigrams. Let 𝒱\mathcal{V} denote the trigram vocabulary and 𝖤:𝒱m\mathsf{E}:\mathcal{V}\!\rightarrow\!\mathbb{R}^{m} the embedding function. We also learn a global importance weight 𝗐:𝒱0\mathsf{w}:\mathcal{V}\!\rightarrow\!\mathbb{R}_{\geq 0}.

For an entity ee, let ew=(v1,,vne)e_{w}=(v_{1},\ldots,v_{n_{e}}) denote the sequence of hashed trigrams extracted from its document. We represent ee by a weighted compositional embedding:

ϕword(e)=j=1ne𝗐(vj)𝖤(vj)m.\phi_{\mathrm{word}}(e)=\sum_{j=1}^{n_{e}}\mathsf{w}(v_{j})\,\mathsf{E}(v_{j})\;\in\mathbb{R}^{m}.

This formulation corresponds to a linear bag-of-subword encoder and is equivalent to the expected embedding of a token drawn proportionally to its trigram weights.

For the entity-based representation, we analogously define eent=(u1,,uke)e_{\mathrm{ent}}=(u_{1},\ldots,u_{k_{e}}) as the hashed tokens obtained from the surface forms of entities linked from ee, and compute:

ϕent(e)=j=1ke𝗐(uj)𝖤(uj).\phi_{\mathrm{ent}}(e)=\sum_{j=1}^{k_{e}}\mathsf{w}(u_{j})\,\mathsf{E}(u_{j}).

The final content representation is their concatenation:

ϕcontent(e)=[ϕword(e);ϕent(e)]2m.\phi_{\mathrm{content}}(e)=[\phi_{\mathrm{word}}(e);\phi_{\mathrm{ent}}(e)]\in\mathbb{R}^{2m}.

For a triplet (es,e(+),e())(e_{s},e_{(+)},e_{(-)}), the concatenated content representations serve as the input to the content sub-network.

5.2 Graph-based representation

For graph-based representations, we follow the DeepWalk framework Perozzi et al. (2014). Let G=(V,E)G=(V,E) be the entity graph, where VV is the set of entities and edges correspond to hyperlinks. DeepWalk generates random walks 𝕊e=(v1,,vL)\mathbb{S}_{e}=(v_{1},\ldots,v_{L}) from each entity ee and optimizes a Skip-gram objective:

max𝖦eVvj𝕊evk𝒩(vj)log(vk𝖦(vj)),\max_{\mathsf{G}}\sum_{e\in V}\sum_{v_{j}\in\mathbb{S}_{e}}\sum_{v_{k}\in\mathcal{N}(v_{j})}\log\mathbb{P}\big(v_{k}\mid\mathsf{G}(v_{j})\big),

where 𝖦:Vd\mathsf{G}:V\!\rightarrow\!\mathbb{R}^{d} is the learned graph embedding and 𝒩(vj)\mathcal{N}(v_{j}) is the context window. This yields graph-aware embeddings capturing co-occurrence structure under random walks.

Given two entities ese_{s} and ete_{t}, let

es={𝖦(v):v𝕊es},et={𝖦(u):u𝕊et}\mathbb{C}_{e_{s}}=\{\mathsf{G}(v):v\in\mathbb{S}_{e_{s}}\},\qquad\mathbb{C}_{e_{t}}=\{\mathsf{G}(u):u\in\mathbb{S}_{e_{t}}\}

denote their graph-context embeddings. For every pair (x,y)(es,et)(x,y)\in(\mathbb{C}_{e_{s}},\mathbb{C}_{e_{t}}), we compute the cosine similarity:

sim(x,y)=x,yxy.\mathrm{sim}(x,y)=\frac{\langle x,y\rangle}{\lVert x\rVert\,\lVert y\rVert}.

Following Guo et al. (2016), we discretize these similarities into BB fixed histogram bins. Let HbH_{b} be the count of similarities falling into bin bb. The interaction vector is:

ϕgraph(es,et)=[log(1+H1),,log(1+HB)]B.\phi_{\mathrm{graph}}(e_{s},e_{t})=\big[\log(1+H_{1}),\;\ldots,\;\log(1+H_{B})\big]\in\mathbb{R}^{B}.

This histogram encodes soft matching of graph neighborhoods, analogous to classical link-based relatedness measures Witten and Milne (2008), but operating at the embedding level.

5.3 Temporal Representation via Time-Weighted Convolution

To learn representations from temporal signals, we treat each entity’s multivariate time series as a sequence of TT aligned observations. Our goal is to embed such sequences into a Euclidean space in which temporally similar entities lie close together. We employ a 1-D convolutional architecture that extracts local temporal patterns and then reweights these patterns according to their temporal proximity to the prediction time. The architecture consists of: (i) a temporal convolution layer, (ii) batch normalization, (iii) a time-decay weighting mechanism, and (iv) a fully connected projection layer.

Temporal convolution.

Let X=(x1,,xT)X=(x_{1},\ldots,x_{T}) denote an entity’s time series, where xtDx_{t}\in\mathbb{R}^{D} is a DD-dimensional feature vector at time tt (e.g., a scalar popularity signal or a small feature bundle). A 1-D convolution with kernel width ww applies a filter 𝐖w×D\mathbf{W}\in\mathbb{R}^{w\times D} and bias bb\in\mathbb{R} to each contiguous window (xtw+1,,xt)(x_{t-w+1},\ldots,x_{t}):

qt=𝐖,Xtw+1:t+b,t=w,,T.q_{t}=\langle\mathbf{W},X_{t-w+1:t}\rangle+b,\qquad t=w,\ldots,T.

Batch normalization and a ReLU activation yield

ht=ReLU(BN(qt)),t=w,,T.h_{t}=\mathrm{ReLU}\!\left(\mathrm{BN}(q_{t})\right),\qquad t=w,\ldots,T.

The resulting sequence 𝐡=(hw,hw+1,,hT)\mathbf{h}=(h_{w},h_{w+1},\ldots,h_{T}) captures local temporal patterns (bursts, surges, and short-term trends).

Time-decay weighting.

Not all temporal patterns should contribute equally: patterns closer to the prediction time are intuitively more relevant. We therefore introduce a deterministic, strictly positive decay function AtA_{t} that assigns higher weight to features extracted from recent time steps.

For the convolution output index tt, denote its temporal distance from the prediction point by Δt=Tt\Delta_{t}=T-t. We define:

At=1(1+Δt)α,α>0,t=w,,T.A_{t}=\frac{1}{(1+\Delta_{t})^{\alpha}},\qquad\alpha>0,\quad t=w,\ldots,T.

The weighted convolutional representation is then:

h~t=Atht,t=w,,T.\tilde{h}_{t}=A_{t}\cdot h_{t},\qquad t=w,\ldots,T.

This mechanism is a deterministic “soft focusing” operation: unlike learned attention, the weights are fixed by temporal proximity, providing an interpretable, monotonic recency bias.

Projection.

Finally, the weighted sequence is flattened and passed through a nonlinear fully connected layer:

z=σ(𝐖fc[h~w;h~w+1;;h~T]+𝐛fc),z=\sigma\!\left(\mathbf{W}_{\mathrm{fc}}\,[\tilde{h}_{w};\tilde{h}_{w+1};\ldots;\tilde{h}_{T}]+\mathbf{b}_{\mathrm{fc}}\right),

which yields the final temporal embedding zdz\in\mathbb{R}^{d}. Only the last convolutional layer is time-weighted, though deeper stacks of convolutional layers can be used.

This design cleanly separates (i) local temporal pattern extraction via convolution, and (ii) global temporal relevance modulation via a principled, monotonic decay, yielding an interpretable and robust temporal representation.

Refer to caption
Figure 4: Architecture of the temporal convolution module. Local temporal patterns are extracted by a 1-D convolution and then modulated by a monotonic time-decay weighting before projection into the final temporal embedding space.

5.4 Learning and Optimization

Our model learns a pairwise preference function over entity candidates. For each training instance ii consisting of a source entity ese_{s} and a positive–negative pair (e(+),e())(e_{(+)},e_{(-)}), the model produces a score ft(i)(es,e)f_{t(i)}(e_{s},e) for each candidate. The probability that e(+)e_{(+)} should be ranked above e()e_{(-)} is defined through a logistic link applied to the score difference:

y¯i=σ(ft(i)(es,e(+))ft(i)(es,e())).\bar{y}_{i}=\sigma\!\big(f_{t(i)}(e_{s},e_{(+)})-f_{t(i)}(e_{s},e_{(-)})\big).

To obtain supervision, we construct soft preference targets from navigation statistics:

P~i=y{es,e(+)}t(i)y{es,e(+)}t(i)+y{es,e()}t(i),\tilde{P}_{i}=\frac{y^{t(i)}_{\{e_{s},e_{(+)}\}}}{y^{t(i)}_{\{e_{s},e_{(+)}\}}+y^{t(i)}_{\{e_{s},e_{(-)}\}}},

which provide a smoothed estimate of how often users prefer e(+)e_{(+)} over e()e_{(-)} when navigating from ese_{s} at time t(i)t(i).

We train the network using the regularized cross-entropy objective:

L=1Ni=1N(P~ilogy¯i+(1P~i)log(1y¯i))+λθ22,L=-\frac{1}{N}\sum_{i=1}^{N}\Big(\tilde{P}_{i}\log\bar{y}_{i}+(1-\tilde{P}_{i})\log(1-\bar{y}_{i})\Big)+\lambda\lVert\theta\rVert_{2}^{2},

where θ\theta denotes all model parameters. This loss is a smooth pairwise surrogate for ranking and provides calibrated probability estimates, enabling the model to distinguish fine differences in temporal relatedness. Parameters are optimized using Adam Kingma and Ba (2014).

6 Experiments

6.1 Dataset

To recap from Section 4.1, we use the click stream datasets in 2016. We also use the corresponding Wikipedia article dumps, with over 4 million entities represented by actual pages. Since the length of the content of an Wikipedia article is often long, in this work, we make use of only its abstract section. To obtain temporal signals of the entity, we use page view statistics of Wikipedia articles and aggregate the counts by month. We fetch the data from June, 2014 up until the studied time, which results in the length of 27 months.

Seed entities and related candidates. To extract popular and trending entities, we extract from the clickstream data the top 10,000 entities based on the number of navigations from major search engines (Google and Bing), at the studied time. Getting the subset of related entity candidates –for efficiency purposes– has been well-addressed in related work Guo and Barbosa (2014); Ponza et al. (2017). In this work, we do not leverage a method and just assume the use of an appropriate one. In the experiment, we resort to choose only candidates which are visited from the seed entities at studied time. We filtered out entity-candidate pairs with too few navigations (less than 10) and considered the top-100 candidates.

Table 2: Statistics of the dataset.
Counts
Total seed entities 10,00010,000
Total entities 1,420,8191,420,819
Candidate per entities (avg.) 142142
Training seed entities 8,0008,000
Dev. seed entities 1,0001,000
Test seed entities 1,0001,000
Training pairs 100,650K100,650K
Dev. pairs 12,420K12,420K
Test pairs 12,590K12,590K

6.2 Models for Comparison

In this paper, we compare our models against the following baselines.

Wikipedia Link-based (WLM): Witten and Milne (2008) proposed a low-cost measure of semantic relatedness based on Wikipedia entity graph, inspired by Normalized Google Distance.

DeepWalk (DW): DeepWalk Perozzi et al. (2014) learned representations of vertices in a graph with a random walk generator and language modeling. We chose not to compare with the matrix factorization approach in Zhao et al. (2015), as even though it allows the incorporation of different relation types (i.e., among entity, category and word), the iterative computation cost over large graphs is very expensive. When consider only entity-entity relation, the performance is reported rather similar to DW.

Entity2Vec Model (E2V): or entity embedding learning using Skip-Gram Mikolov et al. (2013) model. E2V utilizes textual information to capture latent word relationships. Similar to Zhao et al. (2015); Ni et al. (2016), we use Wikipedia articles as training corpus to learn word vectors and reserved hyperlinks between entities.

ParaVecs (PV):  Le and Mikolov (2014); Dai et al. (2015) learned document/entity vectors via the distributed memory (ParaVecs-DM) and distributed bag of words (ParaVecs-DBOW) models, using hierarchical softmax. We use Wikipedia articles as training corpus to learn entity vectors.

RankSVM:  Ceccarelli et al. (2013) learned entity relatedness from a set of 28 handcrafted features, using the traditional learning-to-rank method, RankSVM. We put together additional well-known temporal features Kanhabua et al. (2014); Zhang et al. (2016b) (i.e., time series cross correlation, trending level and predicted popularity based on page views) and report the results of the extended feature set.

For our approach, we tested different combinations of content (denoted as 𝐂𝐨𝐧𝐭𝐞𝐧𝐭𝐄𝐦𝐛\mathbf{Content_{Emb}}), graph, (𝐆𝐫𝐚𝐩𝐡𝐄𝐦𝐛\mathbf{Graph_{Emb}}) and time (TS-CNN-Att) networks. We also test the content and graph networks with pretrained entity representations (i.e., ParaVecs-DM and DeepWalk).

6.3 Experimental Setup

Evaluation procedures. The time granularity is set to months. The studied time tnt_{n} of our experiments is September 2016. From the seed queries, we use 80% for training, 10% for development and 10% for testing, as shown in Table 2. Note that, for the time-aware setting and to avoid leakage and bias as much as possible, the data for training and development (including supervision) are up until time tn1t_{n}-1. In specific, for content and graph data, only tn1t_{n}-1 is used.

Metrics. We use 2 correlation coefficient methods, Pearson and Spearman, which have been used often throughout literature, cf. Dallmann et al. (2016); Ponza et al. (2017). The Pearson index focuses on the difference between predicted-vs-correct relatedness scores, while Spearman focuses on the ranking order among entity pairs. Our work studies on the strength of the dynamic relatedness between entities, hence we focus more on Pearson index. However, traditional correlation metrics do not consider the positions in the ranked list (correlations at the top or bottom are treated equally). For this reason, we adjust the metric to consider the rankings at specific top-k positions, which consequently can be used to measure the correlation for only top items in the ranking (based to the ground truth). In addition, we use Normalized Discounted Cumulative Gain (NDCG) measure to evaluate the recommendation tasks.

Implementation details. All neural models are implemented in TensorFlow. Initial learning rate is tuned amongst {1.e-2, 1.e-3, 1.e-4, 1.e-5}. The batch size is tuned amongst {50, 100, 200}. The weight matrices are initialized with samples from the uniform distribution Glorot and Bengio (2010). Models are trained for maximum 25 epochs. The hidden layers for each network are among {2, 3, 4}, while for hidden nodes are {128, 256, 512}. Dropout rate is set from {0.2, 0.3, 0.5}. The pretrained DW is empirically set to 128 dimensions, and 200 for PV. For CNN, the filter number are in {10, 20, 30}, window size in  {4, 5, 6}, convolutional layers in {1, 2, 3} and decay rate α\alpha in {1.0, 1.5,\cdots,7.5}. 2 conv- layers with window size 5 and 4, number of filters of 20 and 25 respectively are used for decay hyperparameter analysis.

Table 3: Performance of different models on task (1) Pearson, Spearman’s ρ\rho ranking correlation, and task (2) recommendation (measured by nDCG). Bold and underlined numbers indicate best and second-to-best results. \mp shows statistical significant over WLM (p<0.05p<0.05).
Model Pearson ×100\times 100 ρ×100\rho\times 100 nDCG (proxy) nDCG (human)
@10 @30 @50 all all @3 @10 @20 @3 @10 @20

Baselines

WLM 27.6 28.3 24.0 19.4 12.1 0.63 0.59 0.62 0.50 0.46 0.52
RankSVM 28.5 34.7 31.4 20.7 27.5 0.65 0.61 0.64 0.52 0.61 0.65
Entity2Vec 18.6 22.0 21.8 20.5 18.7 0.62 0.60 0.61 0.54 0.53 0.54
DeepWalk 31.3 30.9 21.4 17.6 10.1 0.41 0.43 0.47 0.34 0.38 0.45
ParaVecs-DBOW 18.6 22.0 21.8 20.5 16.0 0.62 0.60 0.61 0.50 0.50 0.55
ParaVecs-DM 19.0 23.0 23.2 22.3 18.3 0.66 0.63 0.63 0.49 0.52 0.58

Model Ablation

TS-CNN 51.9 51.0 43.0 35.8 26.5 0.41 0.43 0.47 0.40 0.43 0.48
TS-CNN-Att (Base) 57.9 49.7 44.7 37.1 24.9 0.43 0.44 0.49 0.38 0.45 0.50
Base+PV 60.6 44.2 41.4 36.4 11.2 0.41 0.43 0.47 0.49 0.51 0.55
Base+DW 43.5 36.5 35.7 32.7 31.0 0.44 0.48 0.53 0.47 0.51 0.52
Base+PV+DW 56.9 46.1 43.4 32.9 28,4 0.41 0.44 0.48 0.49 0.54 0.57
ContentEmbContent_{Emb}+GraphEmbGraph_{Emb} 48.9 40.1 49.9 37.5 27.9 0.67 0.62 0.70 0.61 0.69 0.65
Base+ContentEmbContent_{Emb} 67.1 54.2 53.4 43.7 26.5 0.67 0.69 0.71 0.61 0.72 0.74
Base+GraphEmbGraph_{Emb} 55.2 50.2 41.3 31.5 35.5 0.71 0.75 0.78 0.65 0.78 0.81
Trio 58.6 54.3 50.2 45.4 43.5 0.75 0.78 0.83 0.74 0.82 0.85

6.4 Experimental Tasks

We evaluate our proposed method in two different scenarios: (1) Relatedness ranking and (2) Entity recommendation. The first task evaluates how well we can mimic the ranking via the entity navigation. Here we use the raw number of navigations in Wikipedia clickstream. The second task is formulated as: given an entity, suggest the top-k most related entities to it right now. Since there is no standard ground-truth for this temporal task, we constructed two relevance ground-truths. The first one is the proxy ground-truth, with relevance grade is automatically assigned from the (top-100) most navigated target entities. The graded relevance score is then given as the reversed rank order. For this, all entities in the test set are used. The second one is based on the human judgments with 5-level graded relevance scale, i.e., from 4 - highly relevant to 0 - not (temporally) relevant. Two human experts evaluate on the subset of 20 entities (randomly sampled from the test set), with 600 entity pairs (approx. 30 per seed, using pooling method). The ground-truth size is comparable the widely used ground-truth for static relatedness assessment, KORE Hoffart et al. (2012). The Cohen’s Kappa agreement is 0.72. Performance of the best-performed models on this dataset is then tested with paired t-test against the WLM baseline.

6.5 Results on Relatedness Ranking

We report the performance of the relatedness ranking on the left side of Table 3, with the Pearson and Spearman metrics. Among existing baselines, we observe that link-based approaches i.e., WLM and DeepWalk perform better than others for top-k correlation. Whereas, temporal models yield substantial improvement overall. Specifically, the TS-CNN-Att performs better than the no-attention model in most cases, improves 11% for Pearson@10, and 3% when considering the total rank. Our trio model performs well overall, gives best results for total rank. The duo models (combine base with either pretrained DW or PV) also deliver improvements over the sole temporal ones. We also observer additional gains while combining of temporal base with pretrained DW and PV altogether.

Refer to caption
(a) Decay parameter for time-series embedding.
Refer to caption
(b) Model performances for person-type entities.
Refer to caption
(c) Model performances for social event-type entities.
Figure 5: Performance results for variation of decay parameter and different entity types.
Refer to caption
Figure 6: Convergence of decay parameters.

6.6 Results on Entity Recommendation

Here we report the results on the nDCG metrics. Table 3 (right-side) demonstrates the results for two ground-truth settings (proxy and human). We can observe the good performance of the baselines for this task over conventional temporal models, significantly for proxy setting. It can be explained that, ‘static’ entity relations are ranked high in the non time-aware baselines, hence are still rewarded when considering a fine-grained grading scale (100 level). The margin becomes smaller when comparing in human setting, with the standard 5-level scale. All the models with pretrained representations perform poorly. It shows that for this task, early interaction-based approach is more suitable than purely based on representation.

Table 4: Different top-k rankings for entity Kingsman: The Golden Circle. Italic means irrelevance.
Models
PV-DM TS-CNN-Att Temp+PV Trio
Secret Service Halle Berry Elton John Mark Strong
Spider-Man X-Men Taron Egerton Jeff Bridges
Taron Egerton Jeff Bridges Edward Holcroft Julianne More

6.7 Additional Analysis

We present an anecdotic example of top-selected entities for Kingsman: The Golden Circle in Table 4. While the content-based model favors old relations like the preceding movies, TS-CNN puts popular actress Halle Berry or the recent released X-men: Apocalypse on top. The latter is not ideal as there is not a solid relationship between the two movies. One implication is that the two entities are ranked high is more because of the popularity of themself than the strength of the relationship toward the source entity. The Trio model addresses the issue by taking other perspectives into account, and also balances out the recency and long-term factors, gives the best ranking performance.

Analysis on decay hyper-parameter. We give a study on the effect of decay parameter on performance. Figure 5(a) illustrates the results on PearsonallPearson_{all} and nDCG@10 for the trio model. It can be seen that while nDCG slightly increases, Pearson score peaks while α\alpha in the range [1.5,3.5][1.5,3.5]. Additionally, we show the convergence analysis on α\alpha for TS-CNN-Att in Figure 6. Bigger α\alpha tends to converge faster, but to a significant higher loss when α\alpha is over 5.5 (omitted from the Figure).

Performances on different entity types. We demonstrate in Figures 5(b) and  5(c) the model performances on the person and event types. WLM performs poorer for the latter, that can be interpreted as link-based methods tend to slowly adapt for recent trending entities. The temporal models seem to capture these entites better.

7 Conclusion

In this work, we presented a trio neural model to solve the dynamic entity relatedness ranking problem. The model jointly learns rich representations of entities from textual content, graph and temporal signals. We also propose an effective CNN-based attentional mechanism for learning the temporal representation of an entity. Experiments on ranking correlations and top-kk recommendation tasks demonstrate the effectiveness of our approach over existing baselines. For future work, we aim to incorporate more temporal signals, and investigate on different ‘trainable’ attention mechanisms to go beyond the time-based decay, for instance by incorporating latent topics.

Acknowledgments. This work is funded by the ERC Advanced Grant ALEXANDRIA (grant no. 339233). We thank the reviewers for the suggestions on the content and structure of the paper.

References

  • Aggarwal and Buitelaar (2014) Nitish Aggarwal and Paul Buitelaar. 2014. Wikipedia-based distributional semantics for entity relatedness. In 2014 AAAI Fall Symposium Series.
  • Blanco et al. (2013) Roi Blanco, Berkant Barla Cambazoglu, Peter Mika, and Nicolas Torzec. 2013. Entity recommendations in web search. In ISWC, pages 33–48. Springer.
  • Ceccarelli et al. (2013) Diego Ceccarelli, Claudio Lucchese, Salvatore Orlando, Raffaele Perego, and Salvatore Trani. 2013. Learning relatedness measures for entity linking. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pages 139–148. ACM.
  • Cheng et al. (2012) Weiwei Cheng, Eyke Hüllermeier, Willem Waegeman, and Volkmar Welker. 2012. Label ranking with partial abstention based on thresholded probabilistic models. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2501–2509. Curran Associates, Inc.
  • Chopra et al. (2005) Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 539–546. IEEE.
  • Dai et al. (2015) Andrew M Dai, Christopher Olah, and Quoc V Le. 2015. Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998.
  • Dallmann et al. (2016) Alexander Dallmann, Thomas Niebler, Florian Lemmerich, and Andreas Hotho. 2016. Extracting semantics from random walks on wikipedia: Comparing learning and counting methods.
  • Gabrilovich and Markovitch (2007) Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI’07, pages 1606–1611, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  • Gabrilovich and Markovitch (2009) Evgeniy Gabrilovich and Shaul Markovitch. 2009. Wikipedia-based semantic interpretation for natural language processing. Journal of Artificial Intelligence Research, 34:443–498.
  • Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256.
  • Guo et al. (2016) Jiafeng Guo, Yixing Fan, Qingyao Ai, and W Bruce Croft. 2016. A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pages 55–64. ACM.
  • Guo and Barbosa (2014) Zhaochen Guo and Denilson Barbosa. 2014. Robust entity linking via random walks. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 499–508. ACM.
  • Hoffart et al. (2012) Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, Martin Theobald, and Gerhard Weikum. 2012. Kore: keyphrase overlap relatedness for entity disambiguation. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 545–554. ACM.
  • Hu et al. (2015) Zhiting Hu, Poyao Huang, Yuntian Deng, Yingkai Gao, and Eric Xing. 2015. Entity hierarchy embedding. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 1292–1300.
  • Huang et al. (2013) Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 2333–2338. ACM.
  • Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456.
  • Jiang et al. (2016) Tingsong Jiang, Tianyu Liu, Tao Ge, Lei Sha, Baobao Chang, Sujian Li, and Zhifang Sui. 2016. Towards time-aware knowledge graph completion. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1715–1724.
  • Kang et al. (2015) Changsung Kang, Dawei Yin, Ruiqiang Zhang, Nicolas Torzec, Jianzhang He, and Yi Chang. 2015. Learning to rank related entities in web search. Neurocomputing, 166:309–318.
  • Kanhabua et al. (2014) Nattiya Kanhabua, Tu Ngoc Nguyen, and Claudia Niederée. 2014. What triggers human remembering of events? a large-scale analysis of catalysts for collective memory in wikipedia. In Digital Libraries (JCDL), 2014 IEEE/ACM Joint Conference on, pages 341–350. IEEE.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International Conference on Machine Learning, pages 1188–1196.
  • Lin et al. (2017) Tao Lin, Tian Guo, and Karl Aberer. 2017. Hybrid neural networks for learning the trend in time series.
  • Lu and Li (2013) Zhengdong Lu and Hang Li. 2013. A deep architecture for matching short texts. In Advances in Neural Information Processing Systems, pages 1367–1375.
  • Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • Miliaraki et al. (2015) Iris Miliaraki, Roi Blanco, and Mounia Lalmas. 2015. From selena gomez to marlon brando: Understanding explorative entity search. In Proceedings of the 24th International Conference on World Wide Web, pages 765–775. International World Wide Web Conferences Steering Committee.
  • Moro et al. (2014) Andrea Moro, Alessandro Raganato, and Roberto Navigli. 2014. Entity linking meets word sense disambiguation: a unified approach. Transactions of the Association for Computational Linguistics, 2:231–244.
  • Nguyen et al. (2018) Tu Ngoc Nguyen, Nattiya Kanhabua, and Wolfgang Nejdl. 2018. Multiple models for recommending temporal aspects of entities. In The Semantic Web - 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3-7, 2018, Proceedings, pages 462–480.
  • Ni et al. (2016) Yuan Ni, Qiong Kai Xu, Feng Cao, Yosi Mass, Dafna Sheinwald, Hui Jia Zhu, and Shao Sheng Cao. 2016. Semantic documents relatedness using concept graph representation. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM ’16, pages 635–644, New York, NY, USA. ACM.
  • Ordóñez and Roggen (2016) Francisco Javier Ordóñez and Daniel Roggen. 2016. Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors, 16(1):115.
  • Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM.
  • Ponza et al. (2017) Marco Ponza, Paolo Ferragina, and Soumen Chakrabarti. 2017. A two-stage framework for computing entity relatedness in wikipedia. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, pages 1867–1876, New York, NY, USA. ACM.
  • Tran et al. (2017) Nam Khanh Tran, Tuan Tran, and Claudia Niederée. 2017. Beyond time: Dynamic context-aware entity recommendation. In European Semantic Web Conference, pages 353–368. Springer.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
  • Witten and Milne (2008) Ian H Witten and David N Milne. 2008. An effective, low-cost measure of semantic relatedness obtained from wikipedia links.
  • Yin et al. (2016) Wenpeng Yin, Hinrich Schütze, Bing Xiang, and Bowen Zhou. 2016. Abcnn: Attention-based convolutional neural network for modeling sentence pairs. Transactions of the Association of Computational Linguistics, 4(1):259–272.
  • Yu et al. (2014) Xiao Yu, Hao Ma, Bo-June Paul Hsu, and Jiawei Han. 2014. On building entity recommender systems using user click log and freebase knowledge. In Proceedings of WSDM, pages 263–272. ACM.
  • Zhang et al. (2016a) Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma. 2016a. Collaborative knowledge base embedding for recommender systems. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 353–362. ACM.
  • Zhang et al. (2016b) Lei Zhang, Achim Rettinger, and Ji Zhang. 2016b. A probabilistic model for time-aware entity recommendation. In International Semantic Web Conference, pages 598–614. Springer.
  • Zhao et al. (2015) Yu Zhao, Zhiyuan Liu, and Maosong Sun. 2015. Representation learning for measuring entity relatedness with rich information. In Twenty-Fourth International Joint Conference on Artificial Intelligence.
  • Zheng et al. (2014) Yi Zheng, Qi Liu, Enhong Chen, Yong Ge, and J Leon Zhao. 2014. Time series classification using multi-channels deep convolutional neural networks. In International Conference on Web-Age Information Management, pages 298–310. Springer.