voc.txt is a merge of data from:

* The original Snowball finnish/voc.txt which is licensed as BSD 3-clause (as
  described in ../COPYING).  This list contains exactly 50000 words, but only
  contains words starting with the letters a to i (inclusive).

* A list of 42467 words extracted from a downloaded dump of the Finnish
  Wikipedia like so:

    scripts/wikipedia-dump-to-freq fiwiki-20260402-pages-articles.xml.bz2 150 latin1 | grep -v "''" | grep -v "^'" | grep -v "'$" > voc1.txt

* Some hand-picked examples to provide better coverage for particular stemmer
  rules and/or changes:

    bodø
    bodøhön
    bodøn
    bodøssa
    bodøssä
    bodøstä
    jylhien
    jylhiin
    jylhiä
    jylhän
    jylhät
    jylhää
    jylhään
    pölhö
    pölhöihin
    pölhöjen
    pölhöjä
    pölhön
    pölhöt
    pölhöä
    pölhöön
    strynøhön
    ylhien
    ylhiin
    ylhiä
    ylhä
    ylhän
    ylhät
    ylhää
    ylhään
    bordeaux'hon
    bordeaux'iin
    bordeaux'n
    bordeaux'ssa
    bordeaux'sta
    bordeaux'ta
    calais'hen
    chamonix'n
    chamonix'ssa
    d'huez'n
    dna'han
    glasgow'hun
    glasgow'sta
    prix'hin
    prix'iin
    prix'lla
    prix'n
    prix'ssa
    prix'ssä
    prix'stä
    prix'tä
    pyrénées'n
    rei'illä
    rei'istä
    renault'lle
    renault'lta
    renault'n
    shafi'in
    show'hun
    show'n
    show'ssa
    show'ta
    versailles'n

The word lists were merged like so:

  LANG=C sort -u voc.txt voc1.txt voc2.txt > finnish/voc.txt

output.txt was generated from voc.txt by running it through the stemmer:

  stemwords -l finnish -c UTF_8 -i finnish/voc.txt -o finnish/output.txt

Wikipedia is licensed as: https://creativecommons.org/licenses/by-sa/3.0/
