smile-nlp-kt
Kotlin extensions / Interfaces that extends the Java/Scala implementation/implicits of Smile NLP. Basically a simplification for kotlin (& probably Java) users.
Installation
Jitpack (the easiest)
Add the following to your
build.gradle
. $version
should be equal to the version supplied by tag above.
repositories { maven { url "https://jitpack.io" } } dependencies { implementation 'com.londogard:smile-nlp-kt:$version' }
GitHub Packages
Add the following to your
build.gradle
. $version
should be equal to the version supplied by tag above.
The part with logging into github repository is how I understand that you need to login. If you know a better way please ping me in an issue.
repositories { maven { url = uri("https://maven.pkg.github.com/londogard/smile-nlp-kt") credentials { username = project.findProperty("gpr.user") ?: System.getenv("GH_USERNAME") password = project.findProperty("gpr.key") ?: System.getenv("GH_TOKEN") } } } dependencies { implementation "com.londogard:smile-nlp-kt:$version" }
Installing Smile
Smile-NLP is required to be installed, you can find the artifact here (installable by gradle).
As you can see in the gradle file this is used in conjunction with 2.2.2
currently, but should work with newer versions too.
Usage
I'll go through the usage of the components, copying the structure from the homepage of Smile. Please make sure to read the official documentation for more context, I've tried to extract a short piece of text for each chapter but cut out a lot of text.
Normalization
The function normalize is a simple normalizer for processing Unicode text:
- Apply Unicode normalization form NFKC.
- Strip, trim, normalize, and compress whitespace.
- Remove control and formatting characters.
- Normalize dash, double and single quotes.
Sentence Breaking
Smile implement an efficient rule-based sentence splitter for English.
Word Segmentation
The method words(filter) assumes that an English text has already been segmented into sentences and splits a sentence into tokens. This method includes a filter where we remove stop-words. Please read official docs for more info.
Stemming
Stemming is a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.
Bag of Words
The bag-of-words model is a simple representation of text as the bag of its words, disregarding grammar and word order but keeping multiplicity.
The method bag(stemmer) returns the map of word to frequency. By default, the parameter stemmer use Porter's algorithm. Passing None to disable stemming. There is a similar function bag2(stemmer) that returns a binary bag of words (Set[String]). That is, presence/absence is used instead of frequencies.
The function vectorize(features, bag) converts a bag of words to a feature vector. The parameter features is the token list used as features in machine learning models. Generally it is not a good practice to use all tokens in the corpus as features.
To use these functions we need to extend the interface SmileOperators
found in com.londogard.smile
.
Phrase / Collocation Extraction
We got some other functions called bigram where we find bigrams.
When we want more we can use the ngram function instead.
Keyword Extraction
Beyond finding phrases, keyword extraction is tasked with the automatic identification of terms that best describe the subject of a document, Keywords are the terms that represent the most relevant information contained in the document, i.e. characterization of the topic discussed in a document.
Part-Of-Speech Tagging
A part of speech (PoS) is a category of words which have similar grammatical properties. Words that are assigned to the same part of speech generally display similar behavior in terms of syntax – they play similar roles within the grammatical structure of sentences – and sometimes in terms of morphology, in that they undergo inflection for similar properties.
The rest of the methods supplied by Smile should be easy to use from Kotlin so they're not wrapped.
Please read more in the official documentation.