summarize-kt
Summarization library with an easy-to-use API (pre-loaded models). Currently only extractive summarisation is supported.
The layout:
Possible to test on londogard.com.
Usage
There's an interface named Summarizer
that allows us to select the method of summarization
through its companion object
. Two variants are available:
1) Summarizer.tfIdfSummarizer
2) Summarizer.embeddingClusterSummarizer(threshold: Double = 0.2, simThreshold: Double = 0.95, scoreConfig: ScoringConfig = ScoringConfig.Ghalandari)
Where we have two different scoring configurations for embeddingClusterSummarizer
. Read more in this section.
Summarizer
has two important methods:
Example where we'd return ~30% of the content
Explanation of the different configs
Summarizer
currently support two different versions, either TfIdf
or EmbeddingCluster
where the latter has two different configs.
Term Frequency-Inverse Document Frequency (TFIDF)
TfIdf
uses TfIdf to find the most important sentences and then retrieves those back.
Embedding Cluster
EmbeddingCluster
combines both TfIdf & Word-Embeddings.
In its essence a centroid of the full document is created where we only allow words above a certain TfIdf score to be
contained in the centroid. The centroid is created using Word Embeddings, we pick the words above the threshold
aggregate all their embedding vectors and then normalize - this is the centroid.
When this is done we either
- Find all the sentences that are closest to this centroid (not including sentences
that are too similar to an already included sentence, using the
similarityThreshold
) - The same as above but instead of comparing the sentence to the centroid we compare the centroid of the current summary (with the new sentence added) to the centroid. That is, we now compare our new summary in total with the document so that the sentences plays well together.
The approach is chosen by the ScoringConfig
where the first approach is based on
Rossiello's work and the second is based on
Ghalandari's.
In addition one can also set the TfIdf-threshold mentioned using the threshold
and similarity-threshold
using similarityThreshold
.
OBS if you want to use custom embeddings you'll currently have to fork the project. The emeddings should download if you don't have them (OBS: this takes ~1gb download, then 157mb on HDD).
Installation
The code is uploaded to two different repositories, both Jitpack.io and GitHub Packages.
Jitpack (easiest)
Add the following to your build.gradle
. $version
should be equal to the version supplied by tag above.
GitHub Packages
Add the following to your build.gradle
. $version
should be equal to the version supplied by tag above.
The part with logging into github repository is how I understand that you need to login. If you know a better way please ping me in an issue.