View on GitHub

Entity-resolution-SIGMOD-2020

📷🎥 Entity resolution system for SIGMOD 2020 programming contest

Entity resolution system for Sigmod Programming Contest 2020

Entity Resolution is the problem of identifying and matching different manifestations of the same real-world object in a dataset. Ironically, Entity Resolution has many duplicate names in the literature, including Record Linkage, Deduplication, Approximate Match, Entity Clustering, and so on.

For this task we needed to identify which product specifications (in short, specs) from multiple e-commerce websites represent the same real-world product. All specs refer to cameras and include information about the camera model (e.g. Canon EOS 5D Mark II) and, possibly, accessories (e.g. lens kit, bag, tripod). The challenge is to develop an Entity Resolution system for matching the specs with the same camera models with high precision and recall.

This system was made as part of a course in our bachelor degree in winter semester 2020-2021.

Contest website: ACM SIGMOD Programming Contest 2020

System functionality - Abstract

At the beginning of the program, the folder containing specs data is read and their information is stored in the appropriate structures. Then, Dataset W is read and cliques and negative correlations are formed. The vocabulary is then created and the vectors for are calculated each camera. The data set is then formed with mixed positives and negatives correlations and is divided into three parts - train 60%, test 20%, validation 20%. Afterward, in each iteration the model is trained with the Mini-Batch GD algorithm of Logistic regression using multi-threading. Model training is valued through the test set and then the model is retrained with all its pairs original Dataset that we do not know any correlation for them. At the end of the iterations, a final estimate of the model is made through the validation set.

System task

Our goal was to find all pairs of product specs in dataset X that match, is or refer to the same real-world product. Our output is stored in a CSV file containing only the matching spec pairs found by your system.

Example of output CSV file

left_spec_id, right_spec_id
www.ebay.com//10, www.ebay.com//20
www.ebay.com//10, buy.net//100

Compile & Execution

In the initial directory type

Alternatively

For Dataset-W sigmod_large_labelled_dataset.csv in the initial directory

cd programs/EntityResolution
./entityResolutionPairs -jd ./../../data/camera_specs -csv ./../../data/sigmod_large_labelled_dataset.csv

or for Dataset-Y sigmod_medium_labelled_dataset.csv in the initial directory

cd programs/EntityResolution
./entityResolutionPairs -jd ./../../data/camera_specs -csv ./../../data/sigmod_medium_labelled_dataset.csv

Command line execution with given parameters

For example, cmd for large dataset is:

./entityResolutionPairs -jd ./../../data/camera_specs 
                        -csv ./../../data/sigmod_large_labelled_dataset.csv 
                        -lr learning_rate 
                        -sv step_value 
                        -et euclid_threshold 
                        -rt retrain_threshold 
                        -ep num_of_epochs 
                        -rl retrain_loops 
                        -bs batch_size 
                        -nt number_of_threads

Available options:

Input data

We are provided with a dataset including ~30k specs in JSON format, each spec containing a list of (attribute_name, attribute_value) pairs extracted from a different web page, collected across 24 different web sources. We will refer to this dataset as dataset X.

Each spec is stored as a file, and files are organized into directories, each directory corresponding to a different web source (e.g., www.alibaba.com). All specs refer to cameras and include information about the camera model (e.g. Canon EOS 5D Mark II) and, possibly, accessories (e.g. lens kit, bag, tripod). Accessories do not contribute to product identification: for instance, a Canon EOS 5D Mark II that is sold as a bundle with a bag represents the same core product as a Canon EOS 5D Mark II that is sold alone.

Example of product specification in JSON format

{
  "<page title>": "Samsung Smart WB50F Digital Camera White Price in India with Offers & Full Specifications | PriceDekho.com",
  "brand": "Samsung",
  "dimension": "101 x 68 x 27.1 mm",
  "display": "LCD 3 Inches",
  "pixels": "Optical Sensor Resolution (in MegaPixel)\n16.2 MP"
  "battery": "Li-Ion"
}

We provided also with a labelled dataset in CSV format, containing three columns: “left_spec_id”, “right_spec_id” and “label”. We will refer to this dataset as dataset W (which includes the previously released labelled dataset, referred to as dataset Y).

Example of labelled dataset in CSV format

left_spec_id, right_spec_id, label
www.ebay.com//1, www.ebay.com//2, 1
www.ebay.com//3, buy.net//10, 0

Data structures and algorithms used

Unit testing

For the testing of the structures but also of the models that we have created, we used the library acutest.h.

More information about the acutest library

Multi-threading implementation

For the faster execution of the program, it has been paralleled in two points of the program. But before the operation is mentioned in these points, it will become a concise explanation of the implementation of multithreading. The function is in the folder /modules/JobScheduler and consists of:

We have created two timers, one that parallels the calculation of gradients in batches (the function will be explained in the next chapter she) and one that creates parallel pairs to become the retrain.

We have created two timers, one that parallels the calculation of gradients in batches (this function will be explained in the next chapter) and one that creates parallel pairs to retrain.

Benefits of this option:

  1. The number of threads is not abused (as they are reused) and we take advantage of all the acceleration they can offer.
  2. Independent tasks such as those mentioned above are parallelized resulting in acceleration.

Machine learning model - Logistic Regression

Gradient descent algorithms - GD

We tried three versions of this model:

Full-Batch GD

In this implementation every calculation of some derivatives and the gradient in general required access to the entire Dataset, which we observed that:

Stochastic GD

This algorithm was implemented as part of the second implementation and in short, the model accepts the vector data and for each of them finds the loss and the gradient and updates the stored weights. For this algorithm we observe the following:

Mini-Batch GD

Mini-Batch GD is our final algorithm, which has been implemented and exists in the final implementation of our system. In short, this algorithm is a hybrid of the previous two as it updates the algorithm weights for each data set. The 516, 10264 and 2056 elements are selected as the beam size, based on which our experiments were performed. This algorithm offers us the best results and for which we make the following observations:

Re-trainning to strong probabilities

The retraining of the model is achieved through a loop that lasts either certain repetitions or stops for specific values of a threshold.
At first, the model is trained through the Mini batch GD algorithm of logistic regression using multi-threading. Then, again using multi-threading, all the pairs that had not been determined positive or negative correlation are traversed and given to the model. If a strong positive or negative probability arises, the new pair is stored in a table. This table is then sorted by quick sort from highest probability to lowest. Through the resolve-transitivity-issues function some pairs are integrated into the training set

Model performance

Metrics: Accuracy,Precision,Recall,F1-Score

Best results for medium labelled dataset without re-train

Learning rate Threshold euclid Epochs Batch size Threads Test Accuracy Test Recall Test Precision Test F1 Validation Accuracy Validation Recall Validation Precision Validation F1 Time CPU Time Real
0.001 0.00010 50 2056 10 92.69 8.50 70.93 15.17 92.74 8.36 75.00 15.04 1.41 1.52
0.001 0.00001 50 2056 10 92.69 8.50 70.93 15.17 92.74 8.36 75.00 15.04 1.41 1.54
0.001 0.00100 50 2056 10 92.69 8.50 70.93 15.17 92.74 8.36 75.00 15.04 1.39 1.53
0.100 0.10000 50 2056 10 92.45 16.02 53.00 24.60 92.68 17.41 58.14 26.80 1.40 1.50
0.100 0.00100 50 2056 10 92.45 16.02 53.00 24.60 92.68 17.41 58.14 26.80 1.35 1.44

Best results for large labelled dataset without re-train

Learning rate Threshold euclid Epochs Batch size Threads Test Accuracy Test Recall Test Precision Test F1 Validation Accuracy Validation Recall Validation Precision Validation F1 Time CPU Time Real
0.001 0.00001 50 512 20 88.52 12.06 75.64 20.81 88.61 12.60 77.26 21.66 9.85 10.48
0.001 0.00010 50 512 20 88.52 12.06 75.64 20.81 88.61 12.60 77.26 21.66 9.86 13.65
0.001 0.00100 50 512 20 88.52 12.06 75.64 20.81 88.61 12.60 77.26 21.66 9.86 10.50
0.001 0.10000 50 512 20 88.49 11.50 76.44 19.99 88.56 11.97 77.68 20.74 9.24 9.85
0.010 0.00001 50 512 10 88.42 9.79 80.02 17.45 88.44 9.99 80.34 17.77 10.00 10.62

Best results for large labelled dataset with re-train

Learning rate Threshold euclid Retrain threshold Epochs Batch size Threads Validation Accuracy Validation Recall Validation Precision Validation F1 Time CPU Time Real
0.00100 0.0001 0.020000 50 2056 20 92.80 14.62 64.02 23.81 5.61 10.20
0.001000 0.000100 0.020000 50 1024 20 92.37 3.34 57.14 6.32 13.45 18.61
0.001000 0.000100 0.010000 50 1024 20 92.37 3.34 57.14 6.32 13.57 18.81
0.10000 0.000 0.020 5 1024 20 92.31 0.00 0.00 0.00 1.85 5.33

Scores per retrain epoch

Performance remarks for medium dataset

Remarks based on the diagrams

Comments based on the scoreboards:

Alternative implementation

In the second part of the implementation of the system we were called to implement the categorization tactic in an alternative way. More specifically:

Implementation techniques for increasing systems performance

In the various stages of the work it was observed that the following design options improved the execution time of the program, without straining the memory:

Shell script - Bash

Implemented a shell script that executes the system for multiple variables. We took advantage of this test in order to make a gridsearch.
Execution:

cd ./scripts
./gridsearch.sh

Python notebook

We created a notebook, in order to interpreter system results.

Environment


© Myrto Iglezou && Konstantinos Nikoletos