+
+# SPOTER Embeddings
+
+This repository contains code for the Spoter embedding model.
+
+The model is heavily based on [Spoter] which was presented in
+[Sign Pose-Based Transformer for Word-Level Sign Language Recognition](https://openaccess.thecvf.com/content/WACV2022W/HADCV/html/Bohacek_Sign_Pose-Based_Transformer_for_Word-Level_Sign_Language_Recognition_WACVW_2022_paper.html) with one of the main modifications being
+that this is an embedding model instead of a classification model.
+This allows for several zero-shot tasks on unseen Sign Language datasets from around the world.
+
+
+## Modifications on [SPOTER](https://github.com/matyasbohacek/spoter)
+Here is a list of the main modifications made on Spoter code and model architecture:
+
+* The output layer is a linear layer but trained using triplet loss instead of CrossEntropyLoss. The output of the model
+is therefore an embedding vector that can be used for several downstream tasks.
+* We started using the keypoints dataset published by Spoter but later created new datasets using BlazePose from Mediapipe (as it is done in [Spoter 2](https://arxiv.org/abs/2210.00893)). This improves results considerably.
+* We select batches in a way that they contain several hard triplets and then compute the loss on all hard triplets found in each batch.
+* Some code refactoring to acomodate new classes we implemented.
+* Minor code fix when using rotate augmentation to avoid exceptions.
+
+
+
+
+## Results
+
+
+
+We used the silhouette score to measure how well the clusters are defined during the training step.
+Silhouette score will be high (close to 1) when all clusters of different classes are well separated from each other, and it will be low (close to -1) for the opposite.
+Our best model reached 0.7 on the train set and 0.1 on validation.
+
+### Classification accuracy
+While the model was not trained with classification specifically in mind, it can still be used for that purpose.
+Here we show top-1 and top-5 classifications which are calculated by taking the 1 (or 5) nearest vector of different classes, to the target vector.
+
+To estimate the accuracy for LSA, we take a “train” set as given and then classify the holdout set based on the closest vectors from the “train” set.
+This is done using the model trained on WLASL100 dataset only, to show how our model has zero-shot capabilities.
+
+
+
+
+
+
+## Get Started
+
+The recommended way of running code from this repo is by using **Docker**.
+
+Clone this repository and run:
+```
+docker build -t spoter_embeddings .
+docker run --rm -it --entrypoint=bash --gpus=all -v $PWD:/app spoter_embeddings
+```
+
+> Running without specifying the `entrypoint` will train the model with the hyperparameters specified in `train.sh`
+
+If you prefer running in a **virtual environment** instead, then first install dependencies:
+
+```shell
+pip install -r requirements.txt
+```
+
+> We tested this using Python 3.7.13. Other versions may work.
+
+To train the model, run `train.sh` in Docker or your virtual env.
+
+The hyperparameters with their descriptions can be found in the [train.py](link...) file.
+
+
+## Data
+
+Same as with SPOTER, this model works on top of sequences of signers' skeletal data extracted from videos.
+This means that the input data has a much lower dimension compared to using videos directly, and therefore the model is
+quicker and lighter, while you can choose any SOTA body pose model to preprocess video.
+This makes our model lightweight and able to run in real-time (for example, it takes around 40ms to process a 4-second
+25 FPS video inside a web browser using onnxruntime)
+
+
+
+For ready to use datasets refer to the [Spoter] repository.
+
+For best results, we recommend building your own dataset by downloading a Sign language video dataset such as [WLASL] and then using the `extract_mediapipe_landmarks.py` and `create_wlasl_landmarks_dataset.py` scripts to create a body keypoints datasets that can be used to train the Spoter embeddings model.
+
+You can run these scripts as follows:
+```bash
+# This will extract landmarks from the downloaded videos
+python3 preprocessing.py extract -videos \\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"
\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"\\n\"+\n \"