Nordlys 0.2 documentation¶
Nordlys is a toolkit for entity-oriented and semantic search, created by the IAI group at the University of Stavanger.
Entities (such as people, organizations, or products) are meaningful units for organizing information and can provide direct answers to many search queries. Nordlys is a toolkit for entity-oriented and semantic search.
Nordlys supports 3 functionalities in the context of entity-oriented search:
- Entity retrieval: Returns a ranked list of entities in response to a query
- Entity linking: Identifies entities in a query and links them to the corresponding entry in the Knowledge base
- Target type identification: Detects the target types (or categories) of a query
Check our Web interface documentation for illustration of each of these functionalities.
Nordlys can be used …¶
- through a web-based GUI
- through a RESTful API
- as a command line tool
- as a Python package
Contents¶
Installation¶
Nordlys is a general-purpose semantic search toolkit, which can be used either as a Python package, as a command line tool, or as a service.
Data are a first-class citizen in Nordlys. To make use of the full functionality, the required data backend (MongoDB and Elasticsearch) need to be set up and data collections need to be loaded into them. There is built-in support for specific data collections, including DBpedia and Freebase. You may use the data dumps we prepared, or download, process and index these datasets from the raw sources.
1 Installing the Nordlys package¶
This step is required for all usages of Nordlys (i.e., either as a Python package, command line tool or service).
1.1 Environment¶
Nordlys requires Python 3.5+ and a Python environment you can install packages in. We highly recommend using an Anaconda Python distribution.
1.2 Obtain source code¶
You can clone the Nordlys repo using the following:
$ git clone https://github.com/iai-group/nordlys.git
1.3 Install prerequisites¶
Install Nordlys prerequisites using pip:
$ pip install -r requirements.txt
If you don’t have pip yet, install it using
$ easy_install pip
Note
On Ubuntu, you might need to install lxml using a package manager
$ apt-get install python-lxml
1.4 Test installation¶
You may check if your installation has been successful by running any of the cmd_usage, e.g., from the root of your Nordlys folder issue
$ python -m nordlys.core.retrieval.retrieval
Alternatively, you can try importing Nordlys into a Python project. Make sure your local Nordlys copy is on the PYTHONPATH
. Then, you may try, e.g., from nordlys.core.retrieval.scorer import Scorer
.
Mind that this step is only to check the Python dependencies. It is rather limited what you can do with Nordlys without setting up data backed and loading the data components.
2 Setting up data backend¶
We use MongoDB and Elasticsearch for storing and indexing data. You can either connect to these services already running on some server or set these up on your local machine.
2.1 MongoDB¶
If you need to install MongoDB yourself, follow the instructions here.
Adjust the settings in config/mongo.json
, if needed.
If you’re using macOS, you’ll likely need to change the soft limit of maxfiles to at least 64000, for mongoDB to work properly. Check the maxfiles limit of your system using:
$ launchctl limit
2.2 Elasticsearch¶
If you need to install Elasticsearch yourself, follow the instructions here. Note that Elasticsearch requires Java version 8.
Adjust the settings in config/elastic.json
, if needed.
Nordlys has been tested with Elasticsearch version 5.x (and would likely need updating for newer version).
3 Loading data components¶
Data are a crucial component of Nordlys. While most of the functionality is agnostic of the underlying knowledge base, there is built-in support for working with specific data sources. This primarily means DBpedia, with associated resources from Freebase.
Specifically, Nordlys is shipped with functionality designed around DBpedia 2015-10, and the dumps we provide are for that particular version. However, we provide config files and instructions for working with DBpedia 2016-10, should you prefer a newer version.
Note that you may need only a certain subset of the data, depending on the required functionality. See Data components for a detailed description.
The figure below shows an overview of data sources and their dependencies.
3.1 Load data to MongoDB¶
You can either load the data to MongoDB (i) from dumps that we made available or (ii) from the raw source files (DBpedia, FACC, Word2vec, etc.). Below, we discuss the former option. For the latter, see Building MongoDB sources from raw data. Note that processing from the raw sources takes significantly longer because of the nontrivial amount of data.
To load the data to MongoDB, you need to run the following commands from the main Nordlys folder. Note that the first dump is required for the core Nordlys functionality over DBpedia. The other dumps are optional, depending on whether the respective functionality is needed.
Command | Required for |
---|---|
./scripts/load_mongo_dumps.sh mongo_dbpedia-2015-10.tar.bz2 |
All |
|
EL and EC |
./scripts/load_mongo_dumps.sh mongo_word2vec-googlenews.tar.bz2 |
TTI |
3.2 Download auxiliary data files¶
The following files are needed for various services. You may download them all using
$ ./scripts/download_auxiliary.sh
Description | Location (relative to main Nordlys folder) | Required for |
---|---|---|
Type-to-entity mapping | data/raw-data/dbpedia-2015-10/type2entity-mapping |
TTI |
Freebase-to-DBpedia mapping | data/raw-data/dbpedia-2015-10/freebase2dbpedia |
EL |
Entity snapshot | data/el |
EL 1 |
- 1 If entity annotations are to be limited to a specific set; this file contains the proper named entities in DBpedia 2015-10
3.3 Build Elastic indices¶
There are multiple elastic_indices created for supporting different services. Run the following commands from the main Nordlys folder to build the indices for the respective functionality.
Command | Source | Required for |
---|---|---|
./scripts/build_dbpedia_index.sh core |
MongoDB | ER, EL, TTI |
./scripts/build_dbpedia_index.sh types |
Raw DBpedia files 1 | TTI |
./scripts/build_dbpedia_index.sh uri |
MongoDB | ER 2 |
- 1 Requires short entity abstracts and instance types files
- 2 only for ELR model
Note
To use the 2016-10 version of DBpedia, add 2016-10
as a 2nd argument to the above scripts.
Data components¶
Data sources¶
DBpedia¶
We use DBpedia as the main underlying knowledge base. In particular, we prepared dumps for DBpedia version 2015-10.
DBpedia is distributed, among other formats, as a set of .ttl.bz2 files. We use a selection of these files, as defined in data/config/dbpedia2mongo.config.json
. You can download these files from DBpedia Website directly or running ./scripts/download_dbpedia.sh
from the main Nordlys folder. Running the script will place the downloaded files under data/raw-data/dbpedia-2015-10/
.
We also provide a minimal sample from DBpedia under data/dbpedia-2015-10-sample
, which can be used for testing/development in a local environment.
FACC¶
The Freebase Annotations of the ClueWeb Corpora (FACC) is used for building entity surface form dictionary. You can download the collection from its main Website. and further process it using our scripts. Alternatively, you can download the preprocessed data from our server. Check the README file under data/raw-data/facc for detailed information.
MongoDB collections¶
The table below provides an overview of the MongoDB collections that are used by the different services.
Name | Description | EC | ER | EL | TTI |
---|---|---|---|---|---|
dbpedia-2015-10 |
DBpedia | +1 | +2 | +3 | |
fb2dbp-2015-10 |
Mapping from Freebase to DBpedia IDs | +4 | + | ||
surface_forms_dbpedia |
Entity surface forms from DBpedia | +5 | +6 | ||
surface_forms_facc |
Entity surface forms from FACC | +7 | + | ||
word2vec-googlenews |
Word2vec trained on Google News | +8 |
- 1 for entity ID-based lookup and DBpedia2Freebase mapping functionalities
- 2 only for building the Elastic entity index; not used later in the retrieval process
- 3 for entity-centric TTI method
- 4 for Freebase2DBpedia mapping functionality
- 5 for entity surface form lookup from DBpedia
- 6 for all EL methods other than “commonness”
- 7 for entity surface form lookup from FACC
- 8 for LTR TTI method
Building MongoDB sources from raw data¶
To build the above tables from raw data (as opposed to the provided dumps), first make sure that you have the raw data files.
- For DBpedia, these may be downloaded using
./scripts/download_dbpedia.sh
- For the FACC and Word2vec data files, execute
./scripts/download_raw.sh
To load DBpedia to MongoDB, run
python -m nordlys.core.data.dbpedia.dbpedia2mongo data/config/dbpedia-2015-10/dbpedia2mongo.config.json
Note
To use the DBpedia 2015-10 sample shipped with Nordlys, as opposed to the full collection, change the value path
to data/raw-data/dbpedia-2015-10_sample/
in dbpedia2mongo.config.json
.
Elastic indices¶
Name | Description | ER | EL | TTI |
---|---|---|---|---|
dbpedia_2015_10 |
DBpedia index | + | +1 | +2 |
dbpedia_2015_10_uri |
DBpedia URI-only index | +3 | ||
dbpedia_2015_10_types |
DBpedia types index | +4 |
- 1 for all EL methods other than “commonness”
- 2 only for entity-centric TTI method
- 3 only for ELR entity ranking method
- 4 only for type-centric TTI method
RESTful API¶
The following Nordlys services can be accessed through a RESTful API:
Below we describe the usage for each of the services.
The Nordlys endpoint URL is http://api.nordlys.cc/ .
Entity Retrieval¶
The service presents a ranked list of entities in response to an entity-bearing query.
Endpoint URI¶
http://api.nordlys.cc/er
Example¶
Request:
Response:
{"query": "total recall",}
"total_hits": 1000",
"results": {
"0": {
"entity": "<dbpedia:Total_Recall_(1990_film)>",
"score": -10.042525028471253
},
"1": {
"entity": "<dbpedia:Total_Recall_(2012_film)>",
"score": -10.295316626850521
},
...
}
Parameters¶
The following table lists the parameters needed in the request URL for entity retrieval.
Parameters | |
---|---|
q (required) | Search query |
1st_num_docs | The number of documents that will be re-ranked using a model. The recommended value (esp. for baseline comparisons) is 1000. Lower values, like 100, are recommended only when efficiency matters (default: 1000). |
start | Starting offset for ranked documents (default:0). |
fields_return | Comma-separated list of fields to return for each hit (default: “”). |
model | Name of the retrieval model; Accepted values: “bm25”, “lm”, “mlmprms” (default: “lm”).
|
field | The name of the field used for LM (default: “catchall”). |
fields | Comma-separated list of the fields for PRMS (default: “catchall”). |
field_weights | Comma-separated list of fields and their corresponding weights for MLM (default: “catchall:1”). |
smoothing_method | Smoothing method for LM-based models; Accepted values: jm, dirichlet (default: “dirichlet”). |
smoothing_param | The value of smoothing parameters (lambda or mu); Accepted values: float, “avg_len” (default for “jm”: 0.1, default for “dirichlet”: 2000). |
Entity Linking in Queries¶
The service identifies entities in queries and links them to the corresponding entry in the Knowledge base (DBpedia).
Endpoint URI¶
http://api.nordlys.cc/el
Example¶
Response:
{ "processed_query": "total recall", "query": "total recall", "results": [ { "entity": "<dbpedia:Total_Recall_(1990_film)>", "mention": "total recall", "score": 0.4013333333333334 }, { "entity": "<dbpedia:Total_Recall_(2012_film)>", "mention": "total recall", "score": 0.315 } ] }
Parameters¶
The following table lists the parameters needed in the request URL for entity linking.
Parameters | |
---|---|
q (required) | The search query |
method | The name of method; Accepted values (default: “cmns”)
|
threshold | The entity linking threshold (default: 0.1). |
Target Type Identification¶
The service assigns target types (or categories) to queries from the DBpedia type taxonomy.
Endpoint URI¶
http://api.nordlys.cc/tti
Example¶
- Request:
Response:
{ "query": "obama", "results": { "0": { "score": 3.3290777, "type": "<dbo:Ambassador>" }, "1": { "score": 3.2955842, "type": "<dbo:Election>" }, ... }
Parameters¶
The following table lists the parameters needed in the request URL for target type identification.
Parameters | |
---|---|
q (required) | The search query |
method | The name of method; accepted values: “tc”, “ec”, “ltr” (default: “tc”).
|
num_docs | The number of top ranked target types to retrieve (default: 10). |
start | The starting offset for ranked types. |
model | Retrieval model, if method is “tc” or “ec”; Accepted values: “lm”, “bm25”. |
ec_cutoff | If method is “ec”, rank cut-off of top-K entities for EC TTI. |
field | Field name, if method is “tc” or “ec”. |
smoothing_method | If model is “lm”, smoothing method; accepted values: “jm”, “dirichlet”. |
Smoothing_param | If model is “lm”, smoothing parameter; accepted values: float, “avg_len”. |
Entity Catalog¶
This service is used for representing entities (with IDs, name variants, attributes, and relationships). Additionally, it provides statistics that can be utilized, among others, for result presentation (e.g., identifying prominent properties when generating entity cards).
Endpoint URI¶
http://api.nordlys.cc/ec
Look up entity by ID¶
Request:
http://api.nordlys.cc/ec/lookup_id/<entity_id>
Example:
http://api.nordlys.cc/ec/lookup_id/<dbpedia:Albert_Einstein>
Response:
{ "<dbo:abstract>": ["Albert Einstein was a German-born theoretical physicist ... ], "<dbo:academicAdvisor>": ["<dbpedia:Heinrich_Friedrich_Weber>"], "<dbo:almaMater>": [ "<dbpedia:ETH_Zurich>", "<dbpedia:University_of_Zurich>" ], "<dbo:award>": [ "<dbpedia:Nobel_Prize_in_Physics>", "<dbpedia:Max_Planck_Medal>", ... ], "<dbo:birthDate>": ["1879-03-14"], ... }
Look up entity by name (DBpedia)¶
Looks up an entity by its surface form in DBpedia.
Request:
http://api.nordlys.cc/ec/lookup_sf/dbpedia/<sf>
Example:
Response:
{ "_id": "new york" "<rdfs:label>" : { "<dbpedia:New_York>": 1 } "<dbo:wikiPageDisambiguates>": { "<dbpedia:Manhattan>": 1, "<dbpedia:New_York,_Kentucky>": 1, ... } ... }
Look up entity by name (FACC)¶
Looks up an entity by its surface form in FACC.
Request:
http://api.nordlys.cc/ec/lookup_sf/facc/<sf>
Example:
Response:
{ "_id" : "new york", "facc12" : { "<fb:m.02_286>": 18706787, "<fb:m.02_53fb>": 49, "<fb:m.02_b9l>": 87, "<fb:m.02_l43>": 12, "<fb:m.02_l5n>": 23963, ... } }
Map Freebase entity ID to DBpedia ID¶
Request:
http://api.nordlys.cc/ec/freebase2dbpedia/<fb_id>
Example:
http://api.nordlys.cc/ec/freebase2dbpedia/<fb:m.02_286>
Response:
{ "dbpedia_ids": [ "<dbpedia:New_York_City>" ] }
Map DBpedia entity ID to Freebase ID¶
Request:
http://api.nordlys.cc/ec/dbpedia2freebase/<dbp_id>
Example:
http://api.nordlys.cc/ec/dbpedia2freebase/<dbpedia:New_York>
Response:
{ "freebase_ids": [ "<fb:m.059rby>" ] }
Parameters¶
The following table lists the parameters needed in the request URL for entity catalog.
Parameter | |
---|---|
entity id | It is in the form of “<dbpedia:XXX>”, where XXX denotes the DBpedia/Wikipedia ID of an entity. |
sf | Entity surface form (e.g., “john smith”, “new york”). It needs to be url-escaped. |
fb_id | Freebase ID |
bdp_id | DBpedia ID |
References¶
[1] Jay M Ponte and W Bruce Croft . 1998. A Language modeling approach to information retrieval. In Proc. of SIGIR ‘98. 275–281.
[2] Paul Ogilvie and Jamie Callan. 2003. Combining document representations for known-item search. Proc. of SIGIR ‘03 (2003), 143–150.
[3] Jinyoung Kim, Xiaobing Xue, and W Bruce Croft . 2009. A probabilistic retrieval model for semistructured data. In Proc. of ECIR ‘09. 228–239.
[4] Faegheh Hasibi, Krisztian Balog, and Svein Erik Bratsberg. 2016. Exploiting entity linking in queries for entity retrieval. In Proc. of ICTIR ’16. 171–180. [BIB] [PDF]
[5] Faegheh Hasibi, Krisztian Balog, and Svein Erik Bratsberg. 2015. Entity linking in queries: Tasks and Evaluation. In Proc. of ICTIR ’15. 171–180. [BIB] [PDF]
[6] Krisztian Balog, Robert Neumayer. 2012. Hierarchical target type identification for entity-oriented queries. In Proc. of CIKM ‘12. 2391–2394. [BIB] [PDF]
[7] Shuo Zhang and Krisztian Balog. Design Patterns for Fusion-Based Object Retrieval. In Proc. of ECIR ‘17. 684-690. [BIB] [PDF]
[8] Darío Garigliotti, Faegheh Hasibi, and Krisztian Balog. Target Type Identification for Entity-Bearing Queries. In Proc. of SIGIR ‘17. [BIB] [PDF]
[9] Faegheh Hasibi, Krisztian Balog, and Svein Erik Bratsberg. 2017. Entity linking in queries: Efficiency vs. Effectiveness. In Proc. of ECIR ’17. 40-53. [BIB] [PDF]
Web-based GUI¶
Nordlys is shipped with a web-based graphical user interface, which is built on the Nordlys API. It is a wrapper for all functionalities provided by Nordlys toolkit and can be used, e.g., to perform user studies on result presentation.
The implementation of the Web interface is based on Flask and Bootstrap. Below we describe the functionalities provided by the excerpts from the Web GUI with their interface.
Note
The interface can be accessed via: http://gui.nordlys.cc/
Entity search¶
The entity search tab provides a ranked list of entities in response to an entity-bearing query. We generate the results by calling the Entity retrieval service of our API; e.g.,
http://api.nordlys.cc/er?q=total+recall&model=lm&1st_num_docs=100&fields_return=abstract
By clicking on each result, we present an entity card, containing the factual summary of the entity and an image. We use the entity catalog (EC) service to generate these cards.

Nordlys Web interface - Entity Search
Entity linking in queries¶
For entity linking in queries, we use the baseline CMNS method, with threshold 0.1. For example, we make the following call to the entity linking service of our API:
http://api.nordlys.cc/el?q=arnold+schwarzenegger+total+recall&method=cmns&threshold=0.1

Nordlys Web interface - Entity Linking
Target Type Identification¶
For target type identification, we employ the type centric method; e.g.,
http://api.nordlys.cc/tti?q=obama

Nordlys Web interface - Target Type Identification
Note
For detailed information about our API calls, see api_usage
Command line usage¶
Nodlys provides the following command line applications:
- Core applications
- Retrieval general usage and configurations
- Machine learning general usage and configurations
- Entity-oriented applications
Nordlys architecture¶
Nordlys is based on a multitier architecture with three layers:

Nordlys architecture.
Core tier¶
The core tier provides basic functionalities and is conneted to various the third-party tools. The functionalities include:
- Retrieval (based on Elasticsearch)
- Storage (based on MongoDB)
- Machine learning (based on scikit-learn)
- Evaluation (based on trec-eval)
Additionally, a separate data package is provided with functionality for loading and preprocessing standard data sets (DBpedia, Freebase, ClueWeb, etc.).
It is possible to connect additional external tools (or replace our default choices) by implementing standard interfaces of the respective core modules.
Note
The core layer represents a versatile general-purpose modern IR library, which may also be accessed using command line tools.
Logic tier¶
The logic tier contains the main business logic, which is organized around five main modules:
- Entity provides access to the entity catalog (including knowledge bases and entity surface form dictionaries.
- Query provides the representation of search queries along with various preprocessing methods.
- Features is a collection of entity-related features, which may be used across different search tasks.
- Entity retrieval contains various entity ranking methods.
- Entity linking implements entity linking functionality.
The logic layer may not be accessed directly (i.e.,as a service or as a command line application).
Services tier¶
The services tier provides end-user access to the toolkit’s functionality, throughout the command line, API, and web interface. Four main types of service is available:
nordlys package¶
Nordlys toolkit is provided as the nordlys
Python package.
Nordlys is delivered with an extensive and detailed documentation, from package-level overviews, to module usage examples, to fully documented classes and methods.
Subpackages¶
nordlys.core package¶
Core packages¶
These low-level packages (basic IR, ML, NLP, storage, etc.) are basic building blocks that do not have any interdependencies among each other.
Subpackages¶
nordlys.core.data package¶
Builds a DBpedia type index from entity abstracts.
The index is build directly from DBpedia files in .ttl.bz2 format (i.e., MongoDB is not needed).
python -m nordlys.core.dbpedia.indexer_dbpedia_types -c <config_file>
- index_name: name of the index
- dbpedia_files_path: path to DBpedia .ttl.bz2 files
Authors: | Krisztian Balog, Dario Garigliotti |
---|
-
class
nordlys.core.data.dbpedia.indexer_dbpedia_types.
IndexerDBpediaTypes
(config)[source]¶ Bases:
object
-
build_index
(force=False)[source]¶ Builds the index.
Note: since DBpedia only has a few hundred types, no bulk indexing is needed.
Parameters: force – True iff it is required to overwrite the index (i.e. by creating it by force); False by default. :type force: bool :return:
-
name
¶
-
Adds entity surface forms from the Freebase Annotated ClueWeb Corpora (FACC).
The input to this script is (name variant, Freebase entity, count) triples. See data/facc1/README.md for the preparation of FACC data in such format.
Authors: | Krisztian Balog, Faegheh Hasibi |
---|
nordlys.core.eval package¶
This is the (intro of the) eval package doc. It’s in the __init__.py
of the package, as an example for the other packages, and IT SHOULD BE cleaned up without all of this ;).
It could contain some/all eval hints.
It might contain, e.g., items in this way
- This
- That
and also code snippets like usual, using as usual these double colons and the indented code:
instance = Instance(id)
And that’s it.
(In the following I included all the up-to-now available modules, sectioned by criterion at first sight.)
Console application for eval package.
Author: | Faegheh Hasibi |
---|
-
class
nordlys.core.eval.eval.
Eval
(operation, qrels=None, runs=None, metric=None, output_file=None)[source]¶ Bases:
object
Main entry point for eval package.
Parameters: - operation – operation (see
OPERATIONS
for allowed values) - qrels – name of qrels file
- runs – name of run files
- metric – metric
-
OPERATIONS
= ['query_diff']¶
-
OP_QUERY_DIFF
= 'query_diff'¶
- operation – operation (see
Plots a series of scores which represent differences.
Authors: | Shuo Zhang, Krisztian Balog |
---|
-
class
nordlys.core.eval.plot_diff.
QueryDiff
[source]¶ Bases:
object
-
SCORES
= [25, 20, 10, 5, 0, -1, -5, -10]¶
-
Computes query-level differences between two runs.
Authors: | Shuo Zhang, Krisztian Balog, Dario Garigliotti |
---|
Wrapper for trec_eval.
Authors: | Dario Garigliotti, Shuo Zhang |
---|
-
class
nordlys.core.eval.trec_eval.
TrecEval
[source]¶ Bases:
object
Holds evaluation results obtained using trec_eval.
-
evaluate
(qrels_file, run_file, eval_file=None)[source]¶ Evaluates a runfile using trec_eval. Optionally writes evaluation output to file.
Parameters: - qrels_file – name of qrels file
- run_file – name of run file
- eval_file – name of evaluation output file
-
Utility module for working with TREC qrels files.
- Get statistics about a qrels file
trec_qrels <qrels_file> -o stat
- Filter qrels to contain only documents from a given set
trec_qrels <qrels_file> -o filter_docs -d <doc_ids_file> -f <output_file>
- Filter qrels to contain only queries from a given set
trec_qrels <qrels_file> -o filter_qs -q <query_ids_file> -f <output_file>
Author: | Krisztian Balog |
---|
-
class
nordlys.core.eval.trec_qrels.
TrecQrels
(file_name=None)[source]¶ Bases:
object
Represents relevance judments (TREC qrels).
-
filter_by_doc_ids
(doc_ids_file, output_file)[source]¶ Filters qrels for a set of selected docIDs and outputs the results to a file.
Parameters: - doc_ids_file – File with one docID per line
- output_file – Output file name
-
filter_by_query_ids
(query_ids_file, output_file)[source]¶ Filters qrels for a set of selected queryIDs and outputs the results to a file.
Parameters: - query_ids_file – File with one queryID per line
- output_file – Output file name
-
get_rel
(query_id)[source]¶ Returns relevance level for a given query.
Parameters: query_id – queryID Returns: dict (docID as key and relevance as value) or None
-
Utility module for working with TREC runfiles.
- Get statistics about a runfile
trec_run <run_file> -o stat
- Filter runfile to contain only documents from a given set
trec_run <run_file> -o filter -d <doc_ids_file> -f <output_file> -n <num_results>
Authors: | Krisztian Balog, Dario Garigliotti |
---|
-
class
nordlys.core.eval.trec_run.
TrecRun
(file_name=None, normalize=False, remap_by_exp=False, run_id=None)[source]¶ Bases:
object
Represents a TREC runfile.
Parameters: - file_name – name of the run file
- normalize – whether retrieval scores are to be normalized for each query (default: False)
- remap_by_exp – whether scores are to be converted from the log-domain by taking their exp (default: False)
-
filter
(doc_ids_file, output_file, num_results=100)[source]¶ Filters runfile to include only selected docIDs and outputs the results to a file.
Parameters: - doc_ids_file – file with one doc_id per line
- output_file – output file name
- num_results – number of results per query
-
get_query_results
(query_id)[source]¶ Returns the corresponding RetrievalResults object for a given query.
Parameters: query_id – queryID Return type: nordlys.core.retrieval.retrieval_results.RetrievalResults
-
get_results
()[source]¶ Returns all results.
Returns: a dict with queryIDs as keys and RetrievalResults object as values
nordlys.core.ml package¶
The machine learning package is connected to Scikit-learn and can be used for learning-to-rank and classification purposes.
For information on how to use the package from command line and set the configuration, read ML usage
The file format of the training and test files is json. Each instance is presented as a dictionary, consist of the following elements:
- ID: Instance id
- Target: The target value of the instance.
- Features: All the features, presented in key-value format. Note that all the instances should have the same set of features.
- Properties: All meta-data about the instance; e.g. query ID, or content.
Below is an excerpt from a json data file:
{
"0": {
"properties": {
"query": "papaqui soccer",
"entity": "<dbpedia:Soccer_(1985_video_game)>"
},
"target": "0",
"features": {
"feat1": 1,
"feat2": 0,
"feat3": 25
}
},
"1": {}
}
Note
The sample files for using the ML package are provided under data/ml_sample/
folder.
Note
Currently we provide support for Random Forest(RF) and Gradient Boosted Regression Trees (GBRT).
Cross-validation support.
We assume that instances (i) are uniquely identified by an instance ID and (ii) they have id and score properties. We access them using the Instances class.
Authors: | Faegheh Hasibi, Krisztian Balog |
---|
-
class
nordlys.core.ml.cross_validation.
CrossValidation
(k, instances, callback_train, callback_test)[source]¶ Bases:
object
- Class attributes:
- fold: dict of folds (1..k) with a dict
- {“training”: [list of instance_ids]}, {“testing”: [list of instance_ids]}
Parameters: - k – number of folds
- instances – Instances object
- callback_train – Callback function for training model
- callback_test – Callback function for applying model
-
create_folds
(group_by=None)[source]¶ Creates folds for the data set.
Parameters: group_by – property to group by (instance_id by default)
-
get_folds
(filename=None, group_by=None)[source]¶ Loads folds from file or generates them if the file doesn’t exist.
Parameters: - filename –
- k – number of folds
Returns:
Instance class.
- Features:
- This class supports different features for an instance.
- The features, together with the id of the instance, will be used in machine learning algorithms.
- All Features are stored in a dictionary, where keys are feature names (self.features).
- Instance properties:
- Properties are additional side information of an instance (e.g. query_id, entity_id, …).
- properties are stored in a dictionary (self.properties).
This is the base instance class. Specific type of instances can inherit form class and add more properties to the base class.
Author: | Faegheh Hasibi |
---|
-
class
nordlys.core.ml.instance.
Instance
(id, features=None, target='0', properties=None)[source]¶ Bases:
object
- Class attributes:
- ins_id: (string) features: a dictionary of feature names and values target: (string) target id or class properties: a dictionary of property names and values
-
add_feature
(feature, value)[source]¶ Adds a new feature to the features.
Parameters: feature – (string), feature name :param value
-
add_property
(property, value)[source]¶ Adds a new property to the properties.
Parameters: property – (string), property name :param value
-
features
¶
-
classmethod
from_json
(ins_id, fields)[source]¶ Reads an instance in JSON format and generates Instance.
Parameters: - ins_id – instance id
- fields – A dictionary of fields
:return (ml.Instance)
-
get_property
(property)[source]¶ Returns the value of a given property.
:param property :return value
-
id
¶
-
properties
¶
-
to_json
(file_name=None)[source]¶ Converts instance to the JSON format.
Parameters: file_name – (string) :return JSON dump of the instance.
-
to_libsvm
(features, qid_prop=None)[source]¶ Converts instance to the Libsvm format. - RankLib format:
<target> qid:<qid> <feature>:<value> … # <info>- Example: 3 qid:1 1:1 2:1 3:0 4:0.2 5:0 # 1A
NOTE: the property used for qid(qid_prop) should hold integers
Parameters: - features – the list of features that should be in the output
- qid_prop – property to be used as qid
:return str, instance in the rankLib format.
Instances used for Machine learning algorithms.
- Manages a set of Instance objects
- Loads instance-data from JSON or TSV files
- When using TSV, instance properties, target, and features are loaded from separate files
- Generates a list of instances in JSON or RankLib format
Authors: | Faegheh Hasibi, Krisztian Balog |
---|
-
class
nordlys.core.ml.instances.
Instances
(instances=None)[source]¶ Bases:
object
- Class attributes:
- instances: Instance objects stored in a dictionary indexed by instance id
Parameters: instances – instances in a list or dict - if list then list index is used as the instance ID - if dict then the key is used as the instance ID -
add_instance
(instance)[source]¶ Adds an Instance object to the list of instances.
Parameters: instance – Instance object
-
add_qids
(prop)[source]¶ Generates (integer) q_id-s (for libsvm) based on a given (non-integer) property. It assigns a unique integer value to each different value for that property.
Parameters: prop – name of the property. Returns:
-
append_instances
(ins_list)[source]¶ Appends the list of Instances objects.
Parameters: ins_list – list of Instance objects
-
classmethod
from_json
(json_file)[source]¶ Loads instances from a JSON file.
Parameters: json_file – (string) :return Instances object
-
get_instance
(instance_id)[source]¶ Returns an instance by instance id.
Parameters: instance_id – (string) Returns: Instance object
-
group_by_property
(property)[source]¶ Groups instances by a given property.
:param property :return a dictionary of instance ids {id:[ml.Instance, …], …}
-
to_json
(json_file=None)[source]¶ Converts all instances to JSON and writes it to the file
Parameters: json_file – (string) Returns: JSON dump of all instances.
-
to_libsvm
(file_name=None, qid_prop=None)[source]¶ Converts all instances to the LibSVM format and writes them to the file. - Libsvm format:
<line> .=. <target> qid:<qid> <feature>:<value> … # <info> <target> .=. <float> <qid> .=. <positive integer> <feature> .=. <positive integer> <value> .=. <float> <info> .=. <string>- Example: 3 qid:1 1:1 2:1 3:0 4:0.2 5:0 # 1A
- NOTES:
- The property used for qid(qid_prop) should hold integers
- For pointwise algorithms, we use instance id for qid
- Lines in the RankLib input have to be sorted by increasing qid.
Parameters: - file_name – File to write libsvm format of instances.
- qid_prop – property to be used as qid. If none,
-
to_str
(file_name=None)[source]¶ Converts instances to string and write them to the given file. :param file_name :return: String format of instances
-
to_treceval
(file_name, qid_prop='qid', docid_prop='en_id')[source]¶ Generates a TREC style run file - If there is an entity ranked more than once for the same query, the one with higher score is kept.
Parameters: - file_name – File to write TREC file
- qid_prop – Name of instance property to be used as query ID (1st column)
- docid_prop – Name of instance property to be used as document ID (3rd column)
The command-line application for general-purpose machine learning.
python -m nordlys.core.ml.ml <config_file>
- training_set: nordlys ML instance file format (MIFF)
- test_set: nordlys ML instance file format (MIFF); if provided then it’s always used for testing. Can be left empty if cross-validation is used, in which case the remaining split is used for testing.
- cross_validation:
- k: number of folds (default: 10); use -1 for leave-one-out
- split_strategy: name of a property (normally query-id for IR problems). If set, the entities with the same value for that property are kept in the same split. if not set, entities are randomly distributed among splits.
- splits_file: JSON file with splits (instance_ids); if the file is provided it is used, otherwise it’s generated
- create_splits: if True, creates the CV splits. Otherwise loads the splits from “split_file” parameter.
- model: ML model, currently supported values: rf, gbrt
- category: [regression | classification], default: “regression”
- parameters: dict with parameters of the given ML model
- If GBRT:
- alpha: learning rate, default: 0.1
- tree: number of trees, default: 1000
- depth: max depth of trees, default: 10% of number of features
- If RF:
- tree: number of trees, default: 1000
- maxfeat: max features of trees, default: 10% of number of features
- model_file: the model is saved to this file
- load_model: if True, loads the model
- feature_imp_file: Feature importance is saved to this file
- output_file: where output is written; default output format: TSV with with instance_id and (estimated) target
{
"model": "gbrt",
"category": "regression",
"parameters":{
"alpha": 0.1,
"tree": 10,
"depth": 5
},
"training_set": "path/to/train.json",
"test_set": "path/to/test.json",
"model_file": "path/to/model.txt",
"output_file": "path/to/output.json",
"cross_validation":{
"create_splits": true,
"splits_file": "path/to/splits.json",
"k": 5,
"split_strategy": "q_id"
}
}
Authors: | Faegheh Hasibi, Krisztian Balog |
---|
-
class
nordlys.core.ml.ml.
ML
(config)[source]¶ Bases:
object
-
analyse_features
(model, feature_names)[source]¶ Ranks features based on their importance. Scikit uses Gini score to get feature importances.
Parameters: - model – trained model
- feature_names – list of feature names
-
apply_model
(instances, model)[source]¶ Applies model on a given set of instances.
Parameters: - instances – Instances object
- model – trained model
Returns: Instances
-
nordlys.core.retrieval package¶
The retrieval package provides basic indexing and scoring functionality based on Elasticsearch (v2.3). It can be used both for documents and for entities (as the latter are represented as fielded documents).
Indexing can be done be done by directly reading the content of documents.
The toy_indexer
module provides a toy example.
When the content of documents is stored in MongoDB (e.g., for DBpedia entities), use the indexer_mongo
module for indexing.
For further details on how this module can be used, see indexer_dbpedia
.
For indexing Dbpedia entities, we read the content of entiteis form MongoDB aFor DBpedia entities, we store them on MongoDB and .. todo:: Explain indexing (representing entities as fielded documents, mongo to elasticsearch)
- To speed up indexing, use
add_docs_bulk()
. The optimal number of documents to send in a single bulk depends on the size of documents; you need to figure it out experimentally. - We strongly recommend using the default Elasticsearch similarity (currently BM25) for indexing. (Other similarity functions may be also used; in that case the similarity function can updated after indexing.)
- Our default setting is not to store term positions in the index (for efficiency considerations).
Retrieval is done in two stages:
- First pass: The top
N
documents are retrieved using Elastic’s default search method - Second pass: The (expensive) scoring of the top
N
documents is performed (implemented in the Nordlys)
Nordlys currently supports the following models for second pass retrieval:
- Language modelling (LM) [1]
- Mixture of Language Modesl (MLM) [2]
- Probabilistic Model for Semistructured Data (PRMS) [3]
Check out scorer
module to get inspiration for implementing a new retrieval model.
- Always use a
ElasticCache
object (instead ofElastic
) for getting stats from the index. This class stores index stats in the memory, which highly benefits efficiency. - We recommend to create a new
ElasticCache
object for each query. This way, you will make effiecnt of your machine’s memory.
[1] Jay M Ponte and W Bruce Croft. 1998. A Language modeling approach to information retrieval. In Proc. of SIGIR ‘98.
[2] Paul Ogilvie and Jamie Callan. 2003. Combining document representations for known-item search. Proc. of SIGIR ‘03.
[3] Jinyoung Kim, Xiaobing Xue, and W Bruce Croft. 2009. A probabilistic retrieval model for semistructured data. In Proc. of ECIR ‘09.
Utility class for working with Elasticsearch. This class is to be instantiated for each index.
To create an index, first you need to define field mappings and then build the index.
The sample code for creating an index is provided at nordlys.core.retrieval.toy_indexer
.
The following statistics can be obtained from this class:
- Number of documents:
Elastic.num_docs()
- Number of fields:
Elastic.num_fields()
- Document count:
Elastic.doc_count()
- Collection length:
Elastic.coll_length()
- Average length:
Elastic.avg_len()
- Document length:
Elastic.doc_length()
- Document frequency:
Elastic.doc_freq()
- Collection frequency:
Elastic.coll_term_freq()
- Term frequencies:
Elastic.term_freqs()
- For efficiency reasons, we do not store term positions during indexing. To store them, see the corresponding mapping functions
Elastic.analyzed_field()
,Elastic.notanalyzed_searchable_field()
.- Use
ElasticCache
for getting index statistics. This module caches the statistics into memory and boosts efficeicny.- Mind that
ElasticCache
does not empty the cache!
Authors: | Faegheh Hasibi, Krisztian Balog |
---|
-
class
nordlys.core.retrieval.elastic.
Elastic
(index_name)[source]¶ Bases:
object
-
ANALYZER_STOP
= 'stop_en'¶
-
ANALYZER_STOP_STEM
= 'english'¶
-
BM25
= 'BM25'¶
-
DOC_TYPE
= 'doc'¶
-
FIELD_CATCHALL
= 'catchall'¶
-
FIELD_ELASTIC_CATCHALL
= '_all'¶
-
SIMILARITY
= 'sim'¶
-
add_doc
(doc_id, contents)[source]¶ Adds a document with the specified contents to the index.
Parameters: - doc_id – document ID
- contents – content of document
-
add_docs_bulk
(docs)[source]¶ Adds a set of documents to the index in a bulk.
Parameters: docs – dictionary {doc_id: doc}
-
analyze_query
(query, analyzer='stop_en')[source]¶ Analyzes the query.
Parameters: - query – raw query
- analyzer – name of analyzer
-
static
analyzed_field
(analyzer='stop_en')[source]¶ Returns the mapping for analyzed fields.
For efficiency considerations, term positions are not stored. To store term positions, change
"term_vector": "with_positions_offsets"
Parameters: analyzer – name of the analyzer; valid options: [ANALYZER_STOP, ANALYZER_STOP_STEM]
-
coll_term_freq
(term, field, tv=None)[source]¶ Returns collection term frequency for the given field.
-
create_index
(mappings, model='BM25', model_params=None, force=False)[source]¶ Creates index (if it doesn’t exist).
Parameters: - mappings – field mappings
- model – name of elastic search similarity
- model_params – name of elastic search similarity
- force – forces index creation (overwrites if already exists)
-
get_doc
(doc_id, fields=None, source=True)[source]¶ Gets a document from the index based on its ID.
Parameters: - doc_id – document ID
- fields – list of fields to return (default: all)
- source – return document source as well (default: yes)
-
search
(query, field, num=100, fields_return='', start=0)[source]¶ Searches in a given field using the similarity method configured in the index for that field.
Parameters: - query – query string
- field – field to search in
- num – number of hits to return (default: 100)
- fields_return – additional document fields to be returned
- start – starting offset (default: 0)
Returns: dictionary of document IDs with scores
-
search_complex
(body, num=10, fields_return='', start=0)[source]¶ Supports complex structured queries, which are sent as a
body
field in Elastic search. For detailed information on formulating structured queries, see the official instructions. Below is an example to search in two particular fields that each must contain a specific term.Example: # [explanation of the query] term_1 = "hello" term_2 = "world" body = { "query": { "bool": { "must": [ { "match": {"title": term_1} }, { "match_phrase": {"content": term_2} } ] } } }
Parameters: - body – query body
- field – field to search in
- num – number of hits to return (default: 100)
- fields_return – additional document fields to be returned
- start – starting offset (default: 0)
Returns: dictionary of document IDs with scores
-
term_freqs
(doc_id, field, tv=None)[source]¶ Returns term frequencies of all terms for a given document and field.
-
update_similarity
(model='BM25', params=None)[source]¶ Updates the similarity function “sim”, which is fixed for all index fields.
The method and param should match elastic settings: https://www.elastic.co/guide/en/elasticsearch/reference/2.3/index-modules-similarity.htmlParameters: - model – name of the elastic model
- params – dictionary of params based on elastic
-
This is a cache for elastic index stats; a layer between an index and retrieval. The statistics (such as document and term frequencies) are first read from the index and stay in the memory for further usages.
- Only one instance of Elastic cache needs to be created.
- If running out of memory, you need to create a new object of ElasticCache.
- The class also caches termvectors. To further boost efficiency, you can load term vectors for multiple documents using
ElasticCache.multi_termvector()
.
Author: | Faegheh Hasibi |
---|
-
class
nordlys.core.retrieval.elastic_cache.
ElasticCache
(index_name)[source]¶ Bases:
nordlys.core.retrieval.elastic.Elastic
-
coll_term_freq
(term, field, tv=None)[source]¶ Returns collection term frequency for the given field.
-
This class is a tool for creating an index from a Mongo collection.
To use this class, you need to implement callback_get_doc_content()
function.
See indexer_fsdm
for an example usage of this class.
Author: | Faegheh Hasibi |
---|
-
class
nordlys.core.retrieval.indexer_mongo.
IndexerMongo
(index_name, mappings, collection, model='BM25')[source]¶ Bases:
object
-
build
(callback_get_doc_content, bulk_size=1000)[source]¶ Builds the DBpedia index from the mongo collection.
To speedup indexing, we index documents as a bulk. There is an optimum value for the bulk size; try to figure it out.
Parameters: - callback_get_doc_content – a function that get a documet from mongo and return the content for indexing
- bulk_size – Number of documents to be added to the index as a bulk
-
Console application for general-purpose retrieval.
python -m nordlys.services.er -c <config_file> -q <query>
If -q <query> is passed, it returns the results for the specified query and prints them in terminal.
- index_name: name of the index,
- first_pass:
- 1st_num_docs: number of documents in first-pass scoring (default: 100)
- field: field used in first pass retrieval (default: Elastic.FIELD_CATCHALL)
- fields_return: comma-separated list of fields to return for each hit (default: “”)
- num_docs: number of documents to return (default: 100)
- start: starting offset for ranked documents (default:0)
- model: name of retrieval model; accepted values: [lm, mlm, prms] (default: lm)
- field: field name for LM (default: catchall)
- fields: single field name for LM (default: catchall)
- list of fields for PRMS (default: [catchall]) dictionary with fields and corresponding weights for MLM (default: {catchall: 1})
- smoothing_method: accepted values: [jm, dirichlet] (default: dirichlet)
- smoothing_param: value of lambda or mu; accepted values: [float or “avg_len”], (jm default: 0.1, dirichlet default: 2000)
- query_file: name of query file (JSON),
- output_file: name of output file,
- run_id: run id for TREC output
{"index_name": "dbpedia_2015_10",
"first_pass": {
"1st_num_docs": 1000
},
"model": "prms",
"num_docs": 1000,
"smoothing_method": "dirichlet",
"smoothing_param": 2000,
"fields": ["names", "categories", "attributes", "similar_entity_names", "related_entity_names"],
"query_file": "path/to/queries.json",
"output_file": "path/to/output.txt",
"run_id": "test"
}
Authors: | Krisztian Balog, Faegheh Hasibi |
---|
Result list representation.
- for each hit it holds score and both internal and external doc_ids
Authors: | Faegheh Hasibi, Krisztian Balog |
---|
Various retrieval models for scoring a individual document for a given query.
Authors: | Faegheh Hasibi, Krisztian Balog |
---|
-
class
nordlys.core.retrieval.scorer.
Scorer
(elastic, query, params)[source]¶ Bases:
object
Base scorer class.
-
SCORER_DEBUG
= 0¶
-
-
class
nordlys.core.retrieval.scorer.
ScorerLM
(elastic, query, params)[source]¶ Bases:
nordlys.core.retrieval.scorer.Scorer
Language Model (LM) scorer.
-
DIRICHLET
= 'dirichlet'¶
-
JM
= 'jm'¶
-
static
get_dirichlet_prob
(tf_t_d, len_d, tf_t_C, len_C, mu)[source]¶ Computes Dirichlet-smoothed probability. P(t|theta_d) = [tf(t, d) + mu P(t|C)] / [|d| + mu]
Parameters: Returns: Dirichlet-smoothed probability
-
static
get_jm_prob
(tf_t_d, len_d, tf_t_C, len_C, lambd)[source]¶ Computes JM-smoothed probability. p(t|theta_d) = [(1-lambda) tf(t, d)/|d|] + [lambda tf(t, C)/|C|]
Parameters: Returns: JM-smoothed probability
-
get_lm_term_prob
(doc_id, field, t, tf_t_d_f=None, tf_t_C_f=None)[source]¶ Returns term probability for a document and field.
Parameters: - doc_id – document ID
- field – field name
- t – term
Returns: P(t|d_f)
-
-
class
nordlys.core.retrieval.scorer.
ScorerMLM
(elastic, query, params)[source]¶ Bases:
nordlys.core.retrieval.scorer.ScorerLM
Mixture of Language Model (MLM) scorer.
- Implemented based on:
- Ogilvie, Callan. Combining document representations for known-item search. SIGIR 2003.
-
get_mlm_term_prob
(doc_id, t)[source]¶ Returns MLM probability for the given term and field-weights. p(t|theta_d) = sum(mu_f * p(t|theta_d_f))
Parameters: - lucene_doc_id – internal Lucene document ID
- t – term
Returns: P(t|theta_d)
-
class
nordlys.core.retrieval.scorer.
ScorerPRMS
(elastic, query, params)[source]¶ Bases:
nordlys.core.retrieval.scorer.ScorerLM
PRMS scorer.
nordlys.core.storage package¶
This is the (intro of the) storage package doc. It’s in the __init__.py
of the package, as an example for the other packages, and IT SHOULD BE cleaned up without all of this ;).
No idea what text it could contain :)
It might contain, e.g., items in this way
- This
- That
and also code snippets like usual, using as usual these double colons and the indented code:
mongo = Mongo(host, db, collection)
And that’s it.
(In the following I included all the up-to-now available modules, sectioned by no criteria in particular.)
NTriples parser with URI prefixing
Author: | Krisztian Balog |
---|
-
class
nordlys.core.storage.parser.nt_parser.
Triple
(prefix=None)[source]¶ Bases:
object
Representation of a Triple to be used by the rdflib NTriplesParser.
-
class
nordlys.core.storage.parser.nt_parser.
TripleHandler
[source]¶ Bases:
object
This is an abstract class
-
class
nordlys.core.storage.parser.nt_parser.
TripleHandlerPrinter
[source]¶ Bases:
nordlys.core.storage.parser.nt_parser.TripleHandler
Example triple handler that only prints whatever it received.
URI prefixing.
Author: | Krisztian Balog |
---|
-
class
nordlys.core.storage.parser.uri_prefix.
URIPrefix
(prefix_file='/home/docs/checkouts/readthedocs.org/user_builds/nordlys/checkouts/latest/data/uri_prefix/prefixes.json')[source]¶ Bases:
object
-
nordlys.core.storage.parser.uri_prefix.
convert_txt_to_json
(txt_file, json_file='/home/docs/checkouts/readthedocs.org/user_builds/nordlys/checkouts/latest/data/uri_prefix/prefixes.json')[source]¶ Convert prefixes txt file to json.
This has to be done only once. And only in case there is no .json file, or any changes done in .txt.
Tools for working with MongoDB.
Authors: | Krisztian Balog, Faegheh Hasibi |
---|
-
class
nordlys.core.storage.mongo.
Mongo
(host, db, collection)[source]¶ Bases:
object
Manages the MongoDB connection and operations.
-
ID_FIELD
= '_id'¶
-
append_dict
(doc_id, field, dictkey, value)[source]¶ Appends the value to a given field that stores a dict. If the dictkey is already in use, the value stored there will be overwritten.
Parameters: - doc_id – document id
- field – field
- dictkey – key in the dictionary
- value – value to be increased by
-
append_list
(doc_id, field, value)[source]¶ Appends the value to a given field that stores a list. If the field does not exist yet, it will be created. The value should be a list.
Parameters: - doc_id – document id
- field – field
- value – list, a value to be appended to the current list
-
append_set
(doc_id, field, value)[source]¶ Adds a list of values to a set. If the field does not exist yet, it will be created. The value should be a list.
Parameters: - doc_id – document id
- field – field
- value – list, a value to be appended to the current list
-
find_all
(no_timeout=False)[source]¶ Returns a Cursor instance that allows us to iterate over all documents.
-
inc_in_dict
(doc_id, field, dictkey, value=1)[source]¶ Increments a value that is inside a dict.
Parameters: - doc_id – document id
- field – field
- dictkey – key in the dictionary
- value – value to be increased by
-
nordlys.core.utils package¶
@author: Krisztian Balog
Utility methods for file handling.
Authors: | Krisztian Balog, Faegheh Hasibi |
---|
-
class
nordlys.core.utils.file_utils.
FileUtils
[source]¶ Bases:
object
-
static
dump_tsv
(file_name, data, header=None, append=False)[source]¶ Dumps the data in tsv format.
Parameters: - file_name – name of file
- data – list of list
- header – list of headers
- append – if True, appends the data to the existing file
-
static
load_config
(config)[source]¶ Loads config file/dictionary.
Parameters: config – json file or a dictionary Returns: config dictionary
-
static
nordlys.logic package¶
Services¶
All modules in this package serving for services layer.
Subpackages¶
nordlys.logic.el package¶
This package is the implementation of entity linking.
Class for commonness entity linking approach
Author: | Faegheh Hasibi |
---|
Utility methods for entity linking.
Author: Faegheh Hasibi
-
nordlys.logic.el.el_utils.
is_name_entity
(en_id)[source]¶ Returns true if the entity is considered as proper name entity.
Generative model for interpretation set finding
@author: Faegheh Hasibi
-
class
nordlys.logic.el.greedy.
Greedy
(score_th)[source]¶ Bases:
object
-
create_interpretations
(query_inss)[source]¶ Groups CER instances as interpretation sets.
Return list of interpretations, where each interpretation is a dictionary {mention: (en_id, score), ..}
-
disambiguate
(inss)[source]¶ Takes instances and generates set of entity linking interpretations.
Parameters: inss – Instances object Returns: sets of interpretations [{mention: (en_id, score), ..}, …]
-
is_overlapping
(mentions)[source]¶ Checks whether the strings of a set overlapping or not. i.e. if there exists a term that appears twice in the whole set.
- E.g. {“the”, “music man”} is not overlapping
- {“the”, “the man”, “music”} is overlapping.
NOTE: If a query is “yxxz” the mentions {“yx”, “xz”} and {“yx”, “x”} are overlapping.
Parameters: mentions – A list of strings :return True/False
-
Class for Learning-to-Rank entity linking approach
Author: | Faegheh Hasibi |
---|
-
class
nordlys.logic.el.ltr.
LTR
(query, entity, elastic, fcache, model=None, threshold=None, cmns_th=0.1)[source]¶ Bases:
object
-
get_candidate_inss
()[source]¶ Detects mentions and their candidate entities (with their commoness scores) and generates instances
Returns: Instances object
-
get_features
(ins, cand_ens=None)[source]¶ Generates the features set for each instance.
Parameters: - ins – instance object
- cand_ens – dictionary of candidate entities {en_id: cmns, …}
Returns: dictionary of features {ftr_name: value, …}
-
link
()[source]¶ Links the query to the entity.
Returns: dictionary [{“mention”: xx, “entity”: yy, “score”: zz}, …]
-
static
load_yerd
(gt_file)[source]¶ Reads the Y-ERD collection and returns a dictionary.
Parameters: gt_file – Path to the Y-ERD collection Returns: dictionary {(qid, query, en_id, mention) …}
-
nordlys.logic.elr package¶
This package is the implementation of ELR-based models.
nordlys.logic.entity package¶
This is the entity package.
Provides access to entity catalogs (DBpedia and surface forms).
Author: | Faegheh Hasibi |
---|
-
class
nordlys.logic.entity.entity.
Entity
[source]¶ Bases:
object
-
lookup_en
(entity_id)[source]¶ Looks up an entity by its identifier.
Parameters: entity_id – entity identifier (“<dbpedia:Audi_A4>”) :return A dictionary with the entity document or None.
-
nordlys.logic.er package¶
This is the entity retrieval package.
nordlys.logic.features package¶
This is the features package.
Implements features related to an entity-mention pair.
Author: | Faegheh Hasibi |
---|
Implements features capturing the similarity between entity and a query.
Author: | Faegheh Hasibi |
---|
-
class
nordlys.logic.features.ftr_entity_similarity.
FtrEntitySimilarity
(query, en_id, elastic)[source]¶ Bases:
object
-
DEBUG
= 0¶
-
context_sim
(mention, field='catchall')[source]¶ - LM score of entity to the context of query (context means query - mention)
- E.g. given the query “uss yorktown charleston” and mention “uss”,
- query context is ” yorktown charleston”
Parameters: - mention – string
- field – field name
:return context similarity score
-
lm_score
(field='catchall')[source]¶ Query length normalized LM score between entity field and query
Parameters: field – field name :return MLM score
-
mlm_score
(field_weights)[source]¶ Query length normalized MLM similarity between the entity and query
Parameters: field_weights – dictionary {field: weight, …} :return MLM score
-
nllr
(query, field_weights)[source]¶ - Computes Normalized query likelihood (NLLR):
NLLR(q,d) = sum_{t in q} P(t|q) log P(t| heta_d) - sum_{t in q} p(t|q) log P(t|C) where:
P(t|q) = n(t,q)/|q| P(t|C) = sum_{f} mu_f * P(t|C_f) P(t| heta_d) = smoothed LM/MLM score
Parameters: - query – query
- field_weights – dictionary {field: weight, …}
Returns: NLLR score
-
Implements functionalities over the 300-dim GoogleNews word2vec semantic representations of words.
Author: | Dario Garigliotti |
---|
nordlys.logic.fusion package¶
Abstract class for fusion-based scoring.
Authors: | Shuo Zhang, Krisztian Balog, Dario Garigliotti |
---|
-
class
nordlys.logic.fusion.fusion_scorer.
FusionScorer
(index_name, association_file=None, run_id='fusion')[source]¶ Bases:
object
Parameters: - index_name – name of index
- association_file – association file
-
ASSOC_MODE_BINARY
= 1¶
-
ASSOC_MODE_UNIFORM
= 2¶ Abstract class for any fusion-based method.
-
load_queries
(query_file)[source]¶ Loads the query file :return: query dictionary {queryID:query([term1,term2,…])}
Class for late fusion scorer (i.e., document-centric model).
Authors: | Shuo Zhang, Krisztian Balog, Dario Garigliotti |
---|
-
class
nordlys.logic.fusion.late_fusion_scorer.
LateFusionScorer
(index_name, retr_model, retr_params, num_docs=None, field='content', run_id='fusion', num_objs=100, assoc_mode=1, assoc_file=None)[source]¶ Bases:
nordlys.logic.fusion.fusion_scorer.FusionScorer
Parameters: - index_name – name of index
- assoc_file – document-object association file
- assoc_mode – document-object weight mode, uniform or binary
- retr_model – the retrieval model; valid values: “lm”, “bm25”
- retr_params – config including smoothing method and parameter
- num_objs – the number of ranked objects for a query
- assoc_mode – the fusion weights, which could be binary or uniform
- assoc_file – object-doc association file
nordlys.logic.query package¶
This is the query package.
nordlys.services package¶
Services¶
All modules in this package can be used from the command line or from a RESTful API.
Submodules¶
nordlys.services.api module¶
This is the main console application for the Nordlys API.
Authors: | Krisztian Balog, Faegheh Hasibi, Shuo Zhang |
---|
nordlys.services.ec module¶
Command line end point for entity catalog
python -m nordlys.services.ec -o <operation> -i <input>
- python -m nordlys.services.ec -o lookup_id -i <dbpedia:Audi_A4>
- python -m nordlys.services.ec -o “lookup_sf_dbpedia” -i “audi a4”
- python -m nordlys.services.ec -o “lookup_sf_facc” -i “audi a4”
- python -m nordlys.services.ec -o “dbpedia2freebase” -i “<dbpedia:Audi_A4>”
- python -m nordlys.services.ec -o “freebase2dbpedia” -i “<fb:m.030qmx>”
Author: | Faegheh Hasibi |
---|
nordlys.services.el module¶
The command-line application for entity linking
python -m nordlys.services.el -c <config_file> -q <query>
If -q <query> is passed, it returns the results for the specified query and prints them in terminal.
- method: name of the method
- cmns The baseline method that uses the overall popularity of entities as link targets
- ltr The learning-to-rank model
- threshold: Entity linking threshold; varies depending on the method (default: 0.1)
- step: The step of entity linking process: [linking|ranking|disambiguation], (default: linking)
- kb_snapshot: File containing the KB snapshot of proper named entities; required for LTR, and optional for CMNS
- query_file: name of query file (JSON)
- output_file: name of output file
Parameters of LTR method:
- model_file: The trained model file; (default:”data/el/model.txt”)
- ground_truth: The ground truth file; (optional)
- gen_training_set: If True, generates the training set from the groundtruth and query files; (default: False)
- gen_model: If True, trains the model from the training set; (default: False)
- The other parameters are similar to the nordlys.core.ml.ml settings
nordlys.services.er module¶
Command-line application for entity retrieval.
python -m nordlys.services.er -c <config_file> -q <query>
If -q <query> is passed, it returns the results for the specified query and prints them in terminal.
- index_name: name of the index,
- first_pass:
- num_docs: number of documents in first-pass scoring (default: 100)
- field: field used in first pass retrieval (default: Elastic.FIELD_CATCHALL)
- fields_return: comma-separated list of fields to return for each hit (default: “”)
- num_docs: number of documents to return (default: 100)
- start: starting offset for ranked documents (default:0)
- model: name of retrieval model; accepted values: [lm, mlm, prms] (default: lm)
- field: field name for LM (default: catchall)
- fields: list of fields for PRMS (default: [catchall])
- field_weights: dictionary with fields and corresponding weights for MLM (default: {catchall: 1})
- smoothing_method: accepted values: [jm, dirichlet] (default: dirichlet)
- smoothing_param: value of lambda or mu; accepted values: [float or “avg_len”], (jm default: 0.1, dirichlet default: 2000)
- query_file: name of query file (JSON),
- output_file: name of output file,
- run_id: run id for TREC output
{"index_name": "dbpedia_2015_10",
"first_pass": {
"num_docs": 1000
},
"model": "prms",
"num_docs": 1000,
"smoothing_method": "dirichlet",
"smoothing_param": 2000,
"fields": ["names", "categories", "attributes", "similar_entity_names", "related_entity_names"],
"query_file": "path/to/queries.json",
"output_file": "path/to/output.txt",
"run_id": "test"
}
Author: | Faegheh Hasibi |
---|
nordlys.services.tti module¶
The command-line application for target type identification.
python -m nordlys.services.tti <config_file> -q <query>
If -q <query> is passed, it returns the results for the specified query and prints them in terminal.
- method: name of TTI method; accepted values: [“tc”, “ec”, “ltr”]
- num_docs: number of documents to return
- start: starting offset for ranked documents
- model: retrieval model, if method is “tc” or “ec”; accepted values: [“lm”, “bm25”]
- ec_cutoff: if method is “ec”, rank cut-off of top-K entities for EC TTI
- field: field name, if method is “tc” or “ec”
- smoothing_method: accepted values: [“jm”, “dirichlet”]
- smoothing_param: value of lambda or mu; accepted values: [float or “avg_len”]
- query_file: path to query file (JSON)
- output_file: path to output file (JSON)
- trec_output_file: path to output file (trec_eval-formatted)
{ "method": "ec",
"num_docs": 10,
"model": "lm",
"first_pass": {
"num_docs": 50
},
"smoothing_method": "dirichlet",
"smoothing_param": 2000,
"ec_cutoff": 20,
"query_file": "path/to/queries.json",
"output_file": "path/to/output.txt",
}
Author: | Dario Garigliotti |
---|
Contact¶
Nordlys is written by:
For any queries contact <faegheh.hasibi@ntnu.no> or <krisztian.balog@uis.no>.
Drop us some words! We are looking forward to hear your feedback!