londogard-nlp-toolkit
Londogard Natural Language Processing Toolkit written in Kotlin for the JVM.
This toolkit will be used throughout Londogard libraries/products such as our Summarizer, Text-Generation & more.
The LanguageSupport
enum is used to determine what support different tools like Embeddings or Stopwords have out-of-the-box.
Tool | Info | Docs | Samples (Kotlin Notebook) |
---|---|---|---|
Word Embeddings | Word & Subword Embeddings available in 157 (fastText.cc) & 275 languages (bpemb) out-of-the-box. | embeddings | wordembeddings.ipynb |
Sentence Embeddings | Average & Unsupervised Random Walk Sentence Embeddings | sentence-embeddings | sentence-embeddings.ipynb |
Stopwords | Supports 23 languages out-of-the-box through NLTK's list of stopword | stopwords | stopwords.ipynb |
Word Frequencies | Supports 34 languages out-of-the-box through LuminosoInsight word frequency tables | wordfrequency | wordfreq.ipynb |
Stemming | Supports 14 languages out-of-the-box using Snowball Stemmer under the hood | stemming | stemmer.ipynb |
Tokenizers | Char, Word, Subword & Sentence Tokenizer support! SentencePiece? HuggingFace? It's there! | - tokenizers - sentence-tokenizers |
tokenizer.ipynb |
Vectorizers & Encoders | BagOfWords, TF-IDF, BM25 & OneHot | - vectorizers (TF-IDF, BM-25,..) - count-vectorizers (Count, Hash, ..) encoders (OneHot) - transforms (TF-IDF, BM-25,..) |
TODO |
Keyword Extractions | CooccurenceKeywords based on algorithm proposed in DOI:10.1142/S0218213004001466 | keywords.ipynb | |
Machine Learning | LogisticRegression Classifier (using Gradient Descent), NaïveBayes (binary) & Hidden Markov Model (HMM) as Sequence Classifier | - classifiers (LogisticRegression, NaïveBayes) regression (LinearRegression) - sequence classifier (HiddenMarkovModel) |
See e2e-examples |
Deep Learning (Transformers / HuggingFace) | ClassifierPipeline and TokenClassifierPipeline which supports HuggingFace ONNX model-names & PyTorch from local files |
transformers | See e2e-examples |
spaCy-like API | 🚧WIP🚧 |
Installation
MavenCentral
Guides
Simple end-2-end guides available as notebooks via docs/samples.
This includes:
1. IMDB Sentiment Analysis using Logistic Regression or Naïve Bayes
2. IMDB Sentiment Analysis using HuggingFace Transformers, using ClassifierPipeline.create(<model-name>)
3. POS-Tagging using Hidden Markov Model
4. POS-Tagging using HuggingFace Transformers, using TokenClassifierPipeline.create(<model-name>)
& potentially more.