londogard-nlp-toolkit

Londogard Natural Language Processing Toolkit written in Kotlin for the JVM.
This toolkit will be used throughout Londogard libraries/products such as our Summarizer, Text-Generation & more.

The LanguageSupport enum is used to determine what support different tools like Embeddings or Stopwords have out-of-the-box.

Tool	Info	Docs	Samples (Kotlin Notebook)
Word Embeddings	Word & Subword Embeddings available in 157 (fastText.cc) & 275 languages (bpemb) out-of-the-box.	embeddings	wordembeddings.ipynb
Sentence Embeddings	Average & Unsupervised Random Walk Sentence Embeddings	sentence-embeddings	sentence-embeddings.ipynb
Stopwords	Supports 23 languages out-of-the-box through NLTK's list of stopword	stopwords	stopwords.ipynb
Word Frequencies	Supports 34 languages out-of-the-box through LuminosoInsight word frequency tables	wordfrequency	wordfreq.ipynb
Stemming	Supports 14 languages out-of-the-box using Snowball Stemmer under the hood	stemming	stemmer.ipynb
Tokenizers	Char, Word, Subword & Sentence Tokenizer support! SentencePiece? HuggingFace? It's there!	- tokenizers - sentence-tokenizers	tokenizer.ipynb
Vectorizers & Encoders	BagOfWords, TF-IDF, BM25 & OneHot	- vectorizers (TF-IDF, BM-25,..) - count-vectorizers (Count, Hash, ..) encoders (OneHot) - transforms (TF-IDF, BM-25,..)	TODO
Keyword Extractions	CooccurenceKeywords based on algorithm proposed in DOI:10.1142/S0218213004001466		keywords.ipynb
Machine Learning	LogisticRegression Classifier (using Gradient Descent), NaïveBayes (binary) & Hidden Markov Model (HMM) as Sequence Classifier	- classifiers (LogisticRegression, NaïveBayes) regression (LinearRegression) - sequence classifier (HiddenMarkovModel)	See e2e-examples
Deep Learning (Transformers / HuggingFace)	`ClassifierPipeline` and `TokenClassifierPipeline` which supports HuggingFace ONNX model-names & PyTorch from local files	transformers	See e2e-examples
spaCy-like API	🚧WIP🚧

Installation

MavenCentral

implementation("com.londogard:nlp:$version")

Guides

Simple end-2-end guides available as notebooks via docs/samples.

This includes: 1. IMDB Sentiment Analysis using Logistic Regression or Naïve Bayes 2. IMDB Sentiment Analysis using HuggingFace Transformers, using ClassifierPipeline.create(<model-name>) 3. POS-Tagging using Hidden Markov Model 4. POS-Tagging using HuggingFace Transformers, using TokenClassifierPipeline.create(<model-name>)

& potentially more.