TF*IDF Ranker allows the user getting scores for each term of input text according to the classic formula and its modified version.
We offer two versions of this software, an English one and TF*IDF Ranker_2L that also supports Russian.
This is the classic formula that is widely used in term weighting techniques:
w(ij) = weight of Term Tj in Document Di
tf(ij) = frequency of Term Tj in Document Di
N = number of Documents in corpus
n = number of Documents where term Tj occurs at least once
Once you get a list of terms with their weights arranged in descending order you can copy it to some external editor to use it in various ways, for example to filter stop-words, i.e. words with zero or low scores or, on the contrary, use most salient words that have highest weights to represent the content of input text.
A drawback of this formula is that terms that occur in input text but that cannot be found in corpus get zero scores. In many cases such terms may be important for text understanding. For example a scientist may describe his invention introducing some newly coined terms, or a writer may invent neologisms that are not registered in existing corpora. That is why we modified the classic formula: if a word occurs in input text but doesn’t occur in corpus n in the formula is assigned the value “1”, rather than “0”.
The main problem with this TF*IDF technique is number of texts in corpus, i.e. the value of N. How many texts must the corpus include to be representative enough? No formal criteria have been developed so far and we work at this problem to suggest a solution in the nearest future.
TF*IDF Ranker works on Windows machines and requires .net framework. It supports English texts in .txt format.
How to use
1) With add button add texts to create a corpus.
2) Upload a text to analyze.
3) Select a classic formula or a modified formula. Classic version is a default option.
4). Click analyze.
5). Get a list of terms arranged in descending order of their weights.
5) Copy the list to an external editor for further processing.
TF*IDF Ranker allows to effectively filter stopwords, though much depends on the corpus against which the input text is matched. Sometimes additional filtering is required with thte help of a stoplist. The most representative English stopwords list was compiled by Cristopher Fox https://dl.acm.org/doi/10.1145/378881.378888 . Here is the extended list (we added 5 items) that comprises 426 stopwords:
a, about, above, across, after, again, against, all, almost, alone, along, already, also, although, always, among, an, and, another, any, anybody, anyone, anything, anywhere, are, area, areas, around, as, ask, asked, asking, asks, at, away, b, back, backed, backing, backs, be, because, become, becomes, became, been, before, began, behind, being, beings, best, better, between, big, both, but, by, с, came, can, cannot, case, cases, certain, certainly, clear, clearly, come, could, d, did, didn, differ, different, differently, do, does, don, done, down, downed, downing, downs, during, e, each, early, either, end, ended, ending, ends, enough, even, evenly, ever, every, everybody, everyone, everything, everywhere, f, face, faces, fact, facts, far, felt, few, find, finds, first, for, four, from, full, fully, further, furthered, furthering, furthers, g, gave, general, generally, get, gets, give, given, gives, go, going, good, goods, got, great, greater, greatest, group, grouped, grouping, groups, h, had, has, have, having, he, her, herself, here, high, higher, highest, him, himself, his, how, however, i, if, important, in, interest, interested, interesting, interests, into, is, it, its, itself, j, just, k, keep, keeps, kind, knew, know, known, knows, l, large, largely, last, later, latest, least, less, let, lets, like, likely, long, longer, longest, m, made, make, making, man, many, may, me, member, members, men, might, more, most, mostly, me, mr, mrs, much, must, my, myself, n, necessary, need, needed, needing, needs, never, new, newer, newest, next, no, non, not, nobody, noone, nothing, now, nowhere, number, numbers, о, of, off, often, old, older, oldest, on, once, one, only, open, opened, opening, opens, or, order, ordered, ordering, orders, other, others, our, out, over, p, part, parted, parting, parts, per, perhaps, place, places, point, pointed, pointing, points, possible, present, presented, presenting, presents, problem, problems, put, puts, q, quite, r, re, rather, really, right, room, rooms, s, said, same, saw, say, says, second, seconds, see, sees, seem, seemed, seeming, seems, several, shall, she, should, show, showed, showing, shows, side, sides, since, small, smaller, smallest, so, some, somebody, someone, something, somewhere, state, states, still, such, sure, t, take, taken, than, that, the, their, them, then, there, therefore, these, they, thing, things, think, thinks, this, those, though, thought, thoughts, three, through, thus, to, today, together, too, took, toward, turn, turned, turning, turns, two, u, under, until, up, upon, us, use, uses, used, v, ve, very, w, want, wanted, wanting, wants, was, way, ways, we, well, wells, went, were, what, when, where, whether, which, while, who, whole, whose, why, will, with, within, without, work, worked, working, works, would, y, year, years, yet, you, young, younger, youngest, your, yours.
300 MHz processor
0,5 MB free disc space