Finding the similarity between two words


In our 2011 presentation on Omniopticon, George Shirk and I expanded on some of my earlier work with semantic proximity to attempt to track temporal trends in, among other things, Twitter posts. During the course of our work that year, Twitter changed their "feed" reqiring one to authenticate into Twitter, in order to see its content. Prior to that, content was published more openly on the Internet, and our approach was predicated on the company's earlier strategy. George graduated and the project has languished since then, periodically undergoing the vagaries of changes to linux/cgi scripts and the perpetual changes to SVG (as its original scope is restricted by browser developers seeming to lack either foresight or the expertise in College Algebra that  might be needed to maintain a standard that deals with geometry rather than text).

Among the things we set out to look at in 2011 was the degree to which "trends" might share some sort of semantic proximity despite being expressed with different vocabulary. To focus on this, we relied upon the public domain version of Roget's thesaurus (the original version of which I had used for some decades to ascertain semantic proximities for a variety of purposes. Our presentation revealed a technique I had worked with since about 2001 for hopping through the synonyms of a thesaurus which narrowed down the plethora of synonmyms in Roget's (owing to Roget's somewhat idiosyncratic notion of a semantic taxonomy) to a set of "best" synonyms. The methodology had been used earlier in my Semantic Soup, but was not, in our presentation, elucidated in detail.

I'll seek to explain here, a bit of what is afoot in these works, in hopes of being able to take it a bit further in coming years.

Since little explanation of the methods involved in either the example above or of Semantic Soup has been provided elsewhere, let me say a  few things here.

Since each word in Roget's thesaurus may have its "primary entry" in the index which may have been added later pointing to more than one of the 1000 semantic categories (allowed under Roget's taxonomy) and from there may have pointed to between one and thirty subcategories (roughly 10 to 100 subcategories per category), each word may, in fact, have several thousand "synonyms".

The basic reasoning, is that if a word has many synonyms, as most in Roget's and Moby's do, we might evaluate the strength of synonymy by observing that two words are stronger synonyms if they have a lot of synonyms in common. That is, they are rather like centers of a synonymy cluster.

For a view of the current technique, see the example here:
http://granite.sru.edu/~ddailey/cgi/hyphens?actuary

Basically, this technique begins by constructing a list of all words that 'cooccur' with the given term, in this case: 'actuary'. That is, let us find all words which are listed under the same primary headings as 'actuary'.

So, for  we look for all primary entries containing 'actuary' among their synonyms, we find the following list of primary entries (preceeded by their record numbers):
$ m=~/public_html/moby/mthes/mobythes.txt (a comma -delimitted version of the thesaurus file)

$ grep -n actuary $m|sed s/,.*//
171:accountant
259:actuary
935:annuity
1366:assurance
1503:auditor
1716:bail bond
2758:bond
2787:bookkeeper
3374:calculator
4229:clerk
4727:comptroller
4736:computer
5094:controller
5365:CPA
6034:deductible
9259:figurer
13514:insurance company
13515:insurance policy
13516:insurance
16938:mutual company
19750:policy
21415:recorder
21584:registrar
24360:social security
25262:stock company
28043:underwriter

Now, let us consider the combined vocabulary of all words contained in the actuary and those entries which point to actuary. We'll call these the near-synonyms of 'actuary' namely those which co-occur with actuary in a list of synonyms.


Union (over all synonyms of actuary)


Note: Higher numbers reflect higher similarity. Similarity of "ground" and "floor" for example is 125. "ground" and "ground" is 381.

first word: second word:

----------topic-----------
explaining approach of Semantic Soup Thesaurus




171:accountant
259:actuary
935:annuity
1366:assurance
1503:auditor
Were there words in the larger list, not found in the smaller list, then this would prove interesting.
$ ./doublecount actuary|sort|uniq -c|grep -c 2
26

Shows that the counts are the same -- a situation only possible if each of the to-synonyms is also a from-synonym





$ awk 'BEGIN {RS=":"} /actuary/ {print $1}' w=[^[:alpha:]]$1[^[:alpha:]] $f|sort|uniq -c|sort -n
      1 acceptation
      1 annuity
      1 auditor
      1 bond
      1 bookkeeper
      1 calculator
      1 clerk
      1 comptroller
      1 computer
      1 controller
      1 deductible
      1 insurance
      1 policy
      1 recorder
      1 registrar
      1 underwriter
      3 accountant
      7 actuary

--------
//thesaurus part:
d="<input type='button' value='"
e="' onclick='again(this.value)'>"
for i in `./wordsub $1 14|tail -14`
do
echo $d$i$e
count=`expr $count + 1`
done
------------
//wordsub
f=/home/ddailey/public_html/moby/mthes/thesaurus
if [ ! $3 = "" ]
then
awk 'BEGIN {RS=":"} $0 ~ w {print $0}' w=[^[:alpha:]]$1[^[:alpha:]] $f|
sort|uniq -c|sort -nr|sed -n 2,"$ln"p|awk '{print $0}'
else
awk 'BEGIN {RS=":"} $0 ~ w {print $0}' w=[^[:alpha:]]$1[^[:alpha:]] $f|
sort|uniq -c|sort -nr|sed -n 2,"$ln"p|awk '{print $2}'
fi
--------------
//partsub
$ cat ../../cgi/partsub
#! /bin/sh
echo "Content-type: text/html"
echo
f=/home/ddailey/public_html/moby/mthes/thesaurus
echo "<b>"$1":</b><br>"
        awk 'BEGIN {RS=":"} $0 ~ w {print $0}' w=[^[:alpha:]]$1[^[:alpha:]] $f|
        sort|uniq -c


http://granite.sru.edu/~ddailey/cgi/partsub?actuary
$ p=../../cgi/partsub
52 1 abacus 1 abbe 1 academician 1 acceptation 1 accord 13 accountant 1 accouple 1 accumulate 1 acolyte 1 acquiescence 1 action 26 actuary 1 adder 1 adding 1 addition 1 addressee 1 adherence 1 adhesion 1 adhesive 1 adjunct 1 affiliation 1 affinity 1 agent 1 agglomeration 1 agglutinate 1 agglutination 1 aggregation 2 agreement 1 aid 1 alimony 1 allegiance 1 alliance 1 allotment 1 almoner 4 amanuensis 1
...............
5 treasurer 1 treaty 2 troth 1 troupe 2 trust 5 trustee 1 truth 2 understanding 1 undertaking 1 underwrite 14 underwriter 1 unification 1 unify 1 union 1 unite 1 utility 1 validation 1 verger 1 verification 1 viewer 1 vinculum 3 visitor 1 vow 3 warrant 3 warranty 1 way 1 weld 2 welfare 1 woods 1 woodwind 1 word 1 writer 1 yoke

 $p actuary|wc
    603    1202   10136
 $p actuary|sort -n|tail -14
     11 controller
     11 recorder
     11 registrar
     13 accountant
     13 bookkeeper
     13 calculator
     14 annuity
     14 bond
     14 deductible
     14 insurance
     14 policy
     14 underwriter
     26 actuary

output of  hyphens?actuary:
Xactuary<br>
</td></tr><tr><td colspan=2 align=center>
relatives<br>
<input type='button' value='underwriter' onclick='again(this.value)'>
<input type='button' value='policy' onclick='again(this.value)'>
<input type='button' value='insurance' onclick='again(this.value)'>
<input type='button' value='deductible' onclick='again(this.value)'>
<input type='button' value='bond' onclick='again(this.value)'>
<input type='button' value='annuity' onclick='again(this.value)'>
<br>
<input type='button' value='calculator' onclick='again(this.value)'>
<input type='button' value='bookkeeper' onclick='again(this.value)'>
$ time $p actuary|sort -n|tail
     13 bookkeeper
     13 calculator
     14 annuity
     14 bond
     14 deductible
     14 insurance
     14 policy
     14 underwriter
     26 actuary
     52

real    0m1.068s
user    0m1.041s
sys     0m0.018s


$ time ./synhys actuary|sort -nk2
figurer 5
computer 6
underwriter 7
registrar 9
auditor 10
comptroller 10
controller 10
CPA 10
clerk 11
recorder 11
bookkeeper 12
calculator 12
accountant 13
annuity 13
bail-bond 13
insurance-company 13
insurance-policy 13
mutual-company 13
social-security 13
assurance 14
bond 14
deductible 14
policy 14
stock-company 14
insurance 15

real    0m1.113s
user    0m0.848s
sys     0m0.211s
./synhys actuary|sort -nk2|wc
     25      50     334

$ ../../cgi/partsub actuary|sort -n|wc
    603    1202   10136

$ time ../../cgi/partsub actuary|sort -n|tail -16
     11 clerk
     11 comptroller
     11 controller
     11 recorder
     11 registrar
     13 accountant
     13 bookkeeper
     13 calculator
     14 annuity
     14 bond
     14 deductible
     14 insurance
     14 policy
     14 underwriter
     26 actuary
     52

real    0m0.851s
user    0m0.832s
sys     0m0.017s
--------------------




---------topic ------------
methodological diversion/explanation

w=/home/ddailey/public_html/words -- a reasonably good list of 36,000 words (few capital letters, hyphens, apostrophes, etc)
m=/home/ddailey/public_html/moby/mthes/mobythes.txt -- the original comma delimited Moby Thesaurus

It was noted that very many of the "synonyms" in Moby are phrases (two or more words) written without hyphens. Some of my previous analyses treated each word in such a phrase as a possible synonym leading to certain problems. Therefore, the following file was created in which all such phrases were rewritten as hyphenations.
For example 'a priori' becomes 'a-priori'.  This is simply to keep the phrase together (as a string of characters) for sake of processing. Also whenever "real hyphenations" occurred, they were replaced by double hyphens : "--". So,  as examples:

'absent-minded' becomes 'absent--minded' and
'down-to-earth' becomes 'down--to--earth'

h=/home/ddailey/public_html/moby/mthes/hyph-thes

$ awk 'BEGIN {FS = "," } {print $1}' $h|grep -e --|wc
    194     194    2901

Indicates that there were only 194 primary entries having hyphenations.

Something borne out in Moby by
$ awk 'BEGIN {FS = "," } {print $1}' $m|grep -e -|wc
    194     199    2663




And these two "restructured versions" of the Moby thesaurus (methodologies to be descrribed later)
s=/home/ddailey/public_html/moby/mthes/super25
b=/home/ddailey/public_html/moby/mthes/bestFroms




------------topic-------------
Alternative approaches: considering outbound versus inbound connections.

There are 26 entries and they've been listed by line number and first word (the primary entry).
That is 26 primary entries point to 'actuary'

If, on the other hand we look directly at the primary entry actuary, we see there are 71 'words' pointed to from 'actuary'
$ sed -n 259p $m|tr , '\n'|wc
     71     124    1055
$ sed -n 259p $m
actuary,CA,CPA,abacist,accident insurance,accountant,accountant general,annuity,assurance,auditor,aviation insurance,bail bond,bank accountant,bank examiner,bond,bookkeeper,business life insurance,calculator,casualty insurance,certificate of insurance,certified public accountant,chartered accountant,clerk,comptroller,computer,controller,cost accountant,cost keeper,court bond,credit insurance,credit life insurance,deductible,endowment insurance,estimator,family maintenance policy,fidelity bond,fidelity insurance,figurer,flood insurance,fraternal insurance,government insurance,health insurance,industrial life insurance,insurance,insurance agent,insurance broker,insurance company,insurance man,insurance policy,interinsurance,journalizer,liability insurance,license bond,limited payment insurance,major medical insurance,malpractice insurance,marine insurance,mutual company,ocean marine insurance,permit bond,policy,reckoner,recorder,registrar,robbery insurance,social security,statistician,stock company,term insurance,theft insurance,underwriter
sed -n /^actuary/p $m|tr , '\n' puts them on separate lines:
$ sed -n /^actuary/p $m|tr , '\n'
actuary
CA
CPA
abacist
accident insurance
accountant
accountant general
annuity
assurance
auditor
aviation insurance
....
Suppose we do both
$ grep actuary $m|sed s/,.*//
and
$ sed -n /^actuary/p $m|tr , '\n'
on the same word and then sort the output and count frequencies
$ cat doublecount
m=~/public_html/moby/mthes/mobythes.txt
sed -n /^$1/p $m|tr , '\n'
grep $1 $m|sed s/,.*//

We would expect that those words being listed twice would be exactly those contained in the smaller listing (headings pointing to a word):
]$ ./doublecount actuary|sort|uniq -c
      1 abacist
      1 accident insurance
      2 accountant
      1 accountant general
      2 actuary
      2 annuity
      2 assurance
      2 auditor

however words pointed to which do not point back may be of interest:
./doublecount deduce|wc
    184     251    1612
$ grep deduce $m|sed s/,.*//|wc
     82     103     678
./doublecount deduce|sort|uniq -c|grep 2|wc
     74     168    1197
$ sed -n /^deduce/p $m|tr , '\n'|wc
    102     148     934

$  ./doublecount deduce|sort|uniq -c|wc
    110     267    1887

There are 6 words that point to deduce that are not pointed to from deduce:

$ cat inbound
h=/home/ddailey/public_html/moby/mthes/hyph-thes
comm -23 <(grep ,$1, $h|sed s/,.*//|sort) <(grep ^$1, $h|tr , "\n"|sort)

./inbound deduce
dope-out
estimate
guess
perceive
rule
summon


----------topic------------
directionality

[ddailey@granite mthes]$ grep ^ill, $h|tr , "\n"|wc
    168     168    1715
[ddailey@granite mthes]$ grep ill, $h|sed s/,.*//|wc
   3649    3649   30449
[ddailey@granite mthes]$ grep ,ill, $h|sed s/,.*//|wc
    121     121    1071



There are 29 words that deduce points to that are not reciprocated:
$ comm -13 <(grep ,deduce, $h|sed s/,.*//|sort) <(grep ^deduce, $h|tr , "\n"|sort)
apply-reason
be-afraid
cogitate
deduce
draw-a-conclusion
draw-an-inference
draw-forth
fancy
get-from
logicalize
logicize
make
philosophize
presurmise
provide-a-rationale
provisionally-accept
read-into
reason-that
regard
summon-forth
syllogize
take-as-proved
take-to-be
take-to-mean
use-reason
wangle-out-of
winkle-out
work-out
worm-out

Using super25 instead of h=/home/ddailey/public_html/moby/mthes/hyph-thes
$ cat superinb
s=/home/ddailey/public_html/moby/mthes/super25
comm -23 <(grep ,$1, $s|sed s/:.*//|sort) <(grep ^$1: $s|tr , "\n"|sort)

$ ./superinb deduce
apply-reason
draw-a-conclusion
draw-an-inference
draw-forth
educe
elicit
entertain-a-theory
entertain-ideas
espouse-a-theory
exercise-the-mind
form-ideas
generalize
have-a-theory
hypothesize
intellectualize
judge
logicalize
logicize
philosophize
provide-a-rationale
reason-out
reason-that
syllogize
take-as-proved
theorize
use-reason
wangle-out-of
winkle-out
worm-out
worm-out-of
[ddailey@granite mthes]$ ./superinb deduce|wc
     30      30     380




Then there are 73 for which the connection is bidirectional
analyze
arouse
assume
believe
bring-forth
bring-out
...
take-for-granted
take-it
theorize
think
understand
wangle
worm-out-of
[ddailey@granite mthes]$ comm -12 <(grep deduce $h|sed s/,.*//|sort) <(grep ^deduce, $h|tr , "\n"|sort)|wc
     74      74     605


Applying to 25 randwords
$ time for i in `cat randwords`; do echo $i------------; k=`./inbound $i`; echo $k|wc -w; echo $k; done
amelioration------------
4
about-face euphemism flip-flop re-creation
beverage------------
26
bumper draft dram draught draughtsman draughty drench gulp guzzle lap libation nip peg portion potion pull quaff sip slurp snort suck sup swig swill thirst-quencher tot
blot------------
9
discredit disgrace dishonor dry-up eradicate extirpate libel notoriety percolate
congress------------
10
associate body consortium gather germ-cell get-together house negotiation sit-in spoke
cruelly------------
3
badly roughly severely
dishonourably------------
0
dwarfish------------
59
abstemious ascetic austere dumpy dwarf dwarfed elfin exiguous frugal impoverished incipient jejune lean Lenten Lilliputian limited meager mean midget miserly narrow niggardly paltry parsimonious peewee pocket poor puny pygmy rudimentary runty scant scanty scraggy scrawny scrubby short shriveled shrunken skimpy slender slight slim small spare sparing Spartan squat starvation stingy stinted straitened stunted subsistence thin tiny undersized watery wizened
filamentiferous------------
0
happiness------------
4
fat-of-the-land merriment mirth triumph
heathendom------------
5
allotheism animism heathenism idolatry paganism
ignoramus------------
12
barbarian dolt duffer gawk know-nothing mug mutt philistine poke prune silly spoon
ill------------
23
diabolic diseased drawback fly-in-the-ointment frail ghastly hard harmful hostile in-a-bad-way infirm mischievous off-color peaky poor queasy ropy rotten snake-in-the-grass spite spleen unsound vice
ill-tempered------------
0
indoctrinate------------
2
educate tutor
manure------------
9
dress enrich fatten fecundate fertilize filth fructify impregnate till
mistrustful------------
3
incredulous jealous unbelieving
mustachio------------
0
quinsy------------
1
sniffles
realist------------
4
adherent philosopher pragmatical stylist
succumb------------
9
cave-in grin-and-bear-it live-with pop-off put-up-with soften submissive swallow truckle
teacake------------
0
theoretically------------
3
ideally on-paper principle
toboggan------------
13
coast flit fly glide sail skid skim sled sleigh slide slip slither sweep
venison------------
0

real    0m2.485s
user    0m1.864s
sys     0m0.574s

using super25 instead:
$ time for i in `cat randwords`; do echo $i------------; k=`./superinb $i`; echo $k|wc -w; echo $k; done
amelioration------------
5
amendment change-of-heart recovery restoration upbeat
beverage------------
21
alcoholic-drink aqua-vitae booze drink drinkable firewater fluid-extract fluid-mechanics frosted-shake grog hooch hydraulics hydrogeology juice latex liquid-extract malt potable soda-pop soda-water soft-drink
blot------------
71
adsorb asperse baboon bad--mouth bedabble besmirch bespatter black-eye black-mark blackwash blot-up blue--penciling censure chemisorb chemosorb darken defame dele deletion denigrate dinge disparagement dysphemize ebonize efface effacement engage-in-personalities expose-to-infamy expunction eyesore filter-in gargoyle hang-in-effigy heap-dirt-upon look-a-fright look-a-mess look-bad look-like-hell look-something-terrible maculation macule melanize muckrake nigrify no-beauty obliteration offend-the-eye osmose percolate-in reprimand scandalize scratch-out seep-in slander slurp-up smouch soak-in soil soilage soilure sorb sponge-out sully swill-up throw-mud-at torrefy uglify vilify washing-out wipe wiping-out
congress------------
62
advisory-council Areopagus assignation asymptote audience cabinet city-council cohabitation collision-course colloquy committee common-council concentralization confab confabulation conference contact converging conversation converse convocation council-fire council-of-state council-of-war curia divan exchange-of-views focalization get--together interaction intercommunication intercommunion interplay interview judicatory judicature judiciary kitchen-cabinet legislative-body legislature mutual-approach narrowing-gap negotiations panel parley parliament privy-council round-table shadow-cabinet social-activity social-intercourse social-relations soviet speaking speech spokes symposium talking the-Inquisition tribunal truck turnout
cruelly------------
36
austerely badly crudely dangerously dejectedly despondently devilishly diabolically distressfully ferally fiendishly grimly grossly harshly heartlessly inartistically inclemently inhumanely inhumanly mercilessly pitilessly remorselessly ruthlessly sharkishly slaveringly stringently truculently unacceptably uncompassionately unforgivingly unfortunately unhumanly unluckily unmercifully unremorsefully unsympathetically
dishonourably------------
0

dwarfish------------
13
bitsy elfin homunculus Lilliputian midget nanoid pocket--size runty scraggy Tom-Thumb undersize undersized wee
filamentiferous------------
0

happiness------------
57
beatification beatitude becomingness bed-of-roses bewitchment blessedness blitheness bright-outlook bright-side cheeriness cheery-vein clover contentedness delectation Easy-Street ecstatics Elysium entire-satisfaction eupeptic-mien fleshpots gladsomeness good-humor good-spirits gracious-life gracious-living lap-of-luxury Leibnizian-optimism life-of-ease loaves-and-fishes luxury millennialism optimism optimisticalness overhappiness overjoyfulness paradise peace-of-mind perfectibilism philosophical-optimism Pollyannaism prosperity prosperousness rare-good-humor ravishment rosy-expectation rosy-outlook sanguine-humor sanguineness silver-lining solid-comfort sunniness sunny-side sunshine the-affluent-life the-good-life thriving-condition unalloyed-happiness
heathendom------------
1
paganry
ignoramus------------
34
alphabetarian ament articled-clerk boeotian born-fool congenital-idiot cretin dabbler dummkopf egregious-ass figure-of-fun golem half--wit idiot illiterati jackass juggins middlebrow mongoloid-idiot mooncalf moron natural--born-fool natural-idiot new-boy nonprofessional no-scholar perfect-fool schmuck simpleton smatterer softhead stupid-ass tomfool unintelligentsia
ill------------
47
abomination ailing apocalyptic at-a-reduction badly befoulment below-par besmirchment boding critically-ill crying-evil detriment dirtying evilly feeling-awful feeling-something-terrible ill--boding ill--fated ill--starred in-danger infection inhospitable mortally-ill not-quite-right of-evil-portent out-of-sorts poorly portending ritual-uncleanness sickish sick-unto-death soiling taken-ill the-worst toxin unbenign unbenignant under-the-weather ungracious unhealthfulness unhealthy unhospitable unkind unkindly unneighborly unwell venom
ill-tempered------------
0

indoctrinate------------
3
counterindoctrinate reindoctrinate win-away
manure------------
22
ammonia castor--bean-meal commercial-fertilizer compost cross--fertilize cross--pollen cross--pollinate cross--pollinize enrichener fecundate fecundify fertilizer get-with-child get-with-young nitrogen organic-fertilizer phosphate pollinate pollinize prolificate spermatize superphosphate
mistrustful------------
16
disinclined-to-believe disposed-to-doubt green--eyed green-with-jealousy horn--mad Humean impervious-to-persuasion inconvincible jaundice--eyed sceptical shy-of-belief skeptic unconvincible uncredulous unpersuasible unwilling-to-accept
mustachio------------
0

quinsy------------
31
acute-bronchitis aluminosis amygdalitis anthracosilicosis anthracosis asbestosis Asiatic-flu atypical-pneumonia bituminosis black-lung bronchial-pneumonia bronchiectasis bronchiolitis bronchopneumonia chalicosis chronic-bronchitis collapsed-lung common-cold coniosis coryza croup croupous-pneumonia double-pneumonia dry-pleurisy emphysema empyema epidemic-pleurodynia fibrinous-pneumonia hay-fever Hong-Kong-flu lung-cancer
realist------------
99
actionist action-painter animistic atomistic bandman commonsense Comtist cosmotheistic cubist Cynic Dadaist divisionist down--to--earth empirical empiricist Eretrian essentialist eudaemonistic existential expressionist Fauvist Fichtean free-abstractionist free-expressionist futurist geometricist gesture-calligraphist gesturist hedonic hedonistic Herbartian humanistic hylomorphist hylomorphous hylozoist informalist instrumentalist instrumentist intuitionist Kierkegaardian logical-empiricist magic-realist Marxian master-of-style materialist matter-informalist mentalist metaphysical monist naturalist neoconcretist neoconstructivist neocubist neodadaist neoexpressionist neoimpressionist neoplasticist neotraditionalist nonobjectivist nonrepresentationist nuagist objectivist observed ontologist organicist organic-mechanist panlogistical pantheistic poetic-realist poetic-tachist postconcretist postexpressionist practical--minded preimpressionist rationalistic representationist ripieno Scholastic scientistic sensationalist sensist Socratist Sophist sound--thinking spatialist Spinozist straight--thinking substantialist surrealist syncretist syncretistic synthesist tachist theistic transcendentalistic unsentimental vitalist vitalistic voluntaristic
succumb------------
61
accede acquiesce be-agreeable be-believed be-done-for be-lost be-persuaded be-superseded be-swallowed be-unseen bite-the-dust capitulate cease-to-live come-to-naught come-to-nothing crap-out depart-this-life drop-dead face-the-music fall-asleep fall-down-dead fall-into-disuse fall-senseless find-credence flame-out get-bogged-down get-hung-up get-mired get-tired go-dead go-kaput go-under gray-out grow-weary have-enough knock-under knuckle-down knuckle-under-to live-with-it lose-out lose-the-day not-resist obey pack-up pass-current play-out puff-and-blow put-off-mortality quit-this-world relent return-to-dust say-uncle shrug-it-off sputter-and-stop stop-breathing succumb-to swallow-it swallow-the-pill take-the-count up-and-die yield-the-ghost
teacake------------
0

theoretically------------
0

toboggan------------
15
bobsleigh dogsled double--ripper landslip luge pung range-the-coast sail-coast--wise Skimobile skirt-the-shore slidder Sno--Cat snowmobile stay-in-soundings zip-through
venison------------
2
interconnection viande

real    0m2.197s
user    0m1.513s
sys     0m0.668s



$ sed -n /^deduce/p $h|tr , '\n'|wc
    102     148     934

$ for i in `sed -n /^deduce/p $h|tr , '\n'`; do sed -n /$i,/p $h|awk 'BEGIN {FS=","} /i/{print $1}' i=$i; done|sort|uniq -c|wc
   7452   14904  123877

$ time for i in `sed -n /^deduce/p $h|tr , '\n'`; do sed -n /$i,/p $h|awk 'BEGIN {FS=","} /i/{print $1}' i=$i; done|sort|uniq -c|sort -n

     30 arouse
     30 call-out
     30 call-up
     30 drag-out
     30 rouse
     30 stimulate
     30 summon-up
     30 wangle
     30 worm-out-of
     31 find
     31 get
     31 summon
     32 bring-forth
     32 draw-out
     32 elicit
     32 guess
     33 educe
     33 evoke
     34 obtain
     34 procure
     35 judge
     35 secure
     36 reason
     39 believe
     39 dream
     39 let-be
     40 fancy
     40 prefigure
     40 presuppose
     40 repute
     40 take-for-granted
     41 feel
     41 grant
     41 let
     41 say
     42 daresay
     42 deem
     42 expect
     42 imagine
     42 opine
     42 take-for
     42 take-it
     43 divine
     43 induce
     43 reckon
     43 suspect
     43 understand
     44 presume
     44 suppose
     44 surmise
     45 consider
     46 assume
     48 derive
     49 conceive
     49 think
     52 take
     57 gather
     58 conclude
     67 infer
    101 deduce
real    0m10.371s
user    0m9.402s
sys     0m0.870s

$ time for i in `sed -n /^deduce/p $h|tr , '\n'`; do sed -n /$i,/p $h|awk 'BEGIN {FS=","} $0 ~ lookfor {print $1}' lookfor=$i; done|sort|uniq -c|sort -n

    49 think
     52 take
     57 gather
     58 conclude
     67 infer
    101 deduce




$ cat randomword
w=/home/ddailey/public_html/words
n=`wc -l $w|awk '{print $1}'`
r=`echo|awk 'srand() {print rand()}'`
#echo -e $n"\t"$r"\n"
r=`echo $r "*" $n + 1|bc|sed 's/\..*//'`
#echo $r
sed -n "$r"p $w



$ ./randomword
drowsiness
$ cat randwords
amelioration
beverage
blot
congress
cruelly
dishonourably
dwarfish
filamentiferous
happiness
heathendom
ignoramus
ill
ill-tempered
indoctrinate
manure
mistrustful
mustachio
quinsy
realist
succumb
teacake
theoretically
toboggan
venison
[ddailey@granite mthes]$ wc randwords
 24  24 235 randwords


-----------topic-----------
forming a set of random words for testing run-time of various techniques


$ w=~/public_html/words
[ddailey@granite mthes]$ shuf $w -n 5
palatinate
trillion
evocation
citizenship
credulously
$ shuf $h -n 100|sed s/,.*//>ranh100
[ddailey@granite mthes]$ more ranh100
propitiate
dispense
conduit
triangle
disseminated
fee-simple




$ time for i in `cat randwords`;do echo $i `../../cgi/partsub $i|sort -n|tail -16`;done
amelioration 49 continuity 49 renewal 49 reversal 49 revolution 49 upheaval 50 break 50 shift 50 turn 51 change 53 reform 72 betterment 72 improvement 72 melioration 72 revival 73 amelioration 146
beverage 27 quaff 27 snort 28 draft 28 guzzle 28 pull 28 sip 28 sup 28 swig 28 swill 29 suck 30 drench 31 liquor 46 potation 54 beverage 54 drink 108
blot 47 blur 48 blacken 48 scorch 51 smudge 51 smutch 54 spot 56 smoke 62 smirch 65 brand 65 smear 65 taint 66 stain 66 tarnish 69 sponge 157 blot 314
congress 52 gathering 54 commerce 56 meet 58 conclave 58 session 58 sitting 60 convention 62 connection 62 synod 65 concourse 68 diet 72 assembly 77 meeting 79 council 156 congress 312
cruelly <b>cruelly:</b><br> Content-type: text/html
dishonourably <b>dishonourably:</b><br> Content-type: text/html
dwarfish 40 slender 41 jejune 41 lean 41 mean 41 paltry 41 scrawny 41 spare 42 slight 43 poor 43 puny 43 thin 44 small 56 stunted 57 meager 59 dwarfish 118
filamentiferous <b>filamentiferous:</b><br> Content-type: text/html
happiness 30 well-being 31 bliss 31 comfort 31 ease 32 elation 32 exhilaration 32 gaiety 32 glee 33 exuberance 35 delight 35 joy 36 cheerfulness 39 cheer 57 felicity 72 happiness 144
heathendom 1 weakness 1 worship 1 yearning 2 barbarism 2 dark 2 fetishism 2 monolatry 2 savagery 3 demonolatry 3 idolism 5 animism 5 heathendom 5 heathenism 5 idolatry 5 paganism 10
ignoramus 25 entrant 25 fledgling 25 freshman 25 neophyte 25 novice 25 novitiate 25 probationer 25 recruit 25 rookie 26 abecedarian 27 tyro 30 fool 34 tenderfoot 35 greenhorn 63 ignoramus 126
ill 77 unfavorable 78 dark 78 infelicitous 78 unsuitable 80 unhappy 81 inauspicious 82 inexpedient 83 untoward 86 ill-advised 86 unfortunate 90 evil 105 improper 111 wrong 122 ill 129 bad 986
ill-tempered 15 crabby 15 fractious 15 splenetic 16 crusty 16 irritable 16 waspish 17 crabbed 17 peevish 17 petulant 17 snappish 18 cranky 18 cross 18 testy 22 bad-tempered 24 ill-tempered 48
indoctrinate 12 teach 14 imbue 14 impregnate 14 impress 14 infix 14 infuse 14 inoculate 14 program 15 instill 16 implant 17 brainwash 17 condition 18 catechize 19 inculcate 25 indoctrinate 50
manure 14 coprolite 14 crap 14 defecation 14 feces 14 jakes 14 movement 14 stool 14 turd 15 sewerage 15 shit 16 ordure 16 sewage 21 dung 21 guano 29 manure 58
mistrustful 10 incredulous 16 scrupulous 17 agnostic 17 doubtful 17 dubious 17 uncertain 18 shy 18 skeptical 19 leery 19 questioning 19 suspecting 19 wary 21 distrustful 21 mistrustful 21 suspicious 42
mustachio <b>mustachio:</b><br> Content-type: text/html
quinsy 8 ague 8 hepatitis 8 meningitis 19 asthma 19 bronchitis 19 catarrh 19 cold 19 croup 19 flu 19 grippe 19 influenza 19 laryngitis 19 pneumonia 19 quinsy 19 rheum 38
realist 21 secular 21 theistic 22 empirical 22 mechanistic 22 scholastic 23 humanist 23 naturalistic 23 practical 24 eclectic 24 pragmatist 25 positivist 25 utilitarian 36 pragmatic 37 realistic 42 realist 84
succumb 28 pant 28 resign 29 wilt 30 flag 31 blow 31 die 32 droop 32 expire 32 yield 34 faint 34 go 34 pass 44 sink 48 drop 86 succumb 172
teacake <b>teacake:</b><br> Content-type: text/html
theoretically 1 substructure 1 tenet 1 theorem 1 thoroughly 1 totally 1 truism 1 truth 1 undercarriage 1 usage 1 viewpoint 1 vocation 1 wholly 1 written 3 ideally 3 theoretically 6
toboggan 19 glide 19 glissade 19 sail 19 skate 19 skateboard 19 ski 19 skid 19 skim 19 sled 19 sleigh 19 slide 19 slip 19 slither 19 sweep 19 toboggan 38
venison 2 work 3 tea 4 stew 13 aspic 13 barbecue 13 civet 13 flesh 13 game 13 hash 13 jerky 13 joint 13 meat 13 mince 13 roast 13 venison 26

real    0m13.696s
user    0m13.479s
sys     0m0.191s

part-hyp is maybe easier?
$ cat part-hyp
#! /bin/sh
echo "Content-type: text/html"
echo
echo $1"<br>"
f=/home/ddailey/public_html/moby/mthes/hyph-thes
grep ,$1, $f|tr ',' '\n'|sort|uniq -c|sort -nr|head -25


$ time for i in `cat randwords`; do ./part-hyp $i; done

real    0m2.240s
user    0m1.924s
sys     0m0.252s

$ cat worhyp
f=/home/ddailey/public_html/moby/mthes/hyph-thes
grep ,$1, $f|tr ',' '\n'|sort|uniq -c|sort -nr|head -25|awk '{print $2}'|
awk 'BEGIN {FS="\n"; RS=":"}
{out=$1;for(i=2;i<NF;i++)
{out=out","$i}; print w":"out}' w=$1


$ time for i in `cat randwords`; do ./worhyp $i; done
venison:viande,venison,scrapple,sausage-meat,roast,pot-roast,pemmican,mince,menue-viande,meat,jugged-hare,joint,jerky,hash,hachis,game,forcemeat,flesh,civet,bouilli,boiled-meat,barbecue,aspic,stew,tea

real    0m2.398s
user    0m2.038s
sys     0m0.324s


word $ ./synhys amelioration|sort -nk2|tail -16 time (./synhys) ../../cgi/partsub amelioration|sort -n|tail -16
amelioration remaking 47
transition 47
turnabout 47
variety 47
reversal 48
revolution 48
switch 48
upheaval 48
renewal 49
shift 50
break 51
turn 52
reform 53
betterment 71
improvement 71
revival 72
real    0m4.646s
user    0m4.045s
sys     0m0.535s
     49 continuity
     49 renewal
     49 reversal
     49 revolution
     49 upheaval
     50 break
     50 shift
     50 turn
     51 change
     53 reform
     72 betterment
     72 improvement
     72 melioration
     72 revival
     73 amelioration
real    0m1.019s
user    0m0.994s
sys     0m0.023s
beverage
peg 27
portion 27
sip 27
snort 27
sup 27
swig 27
draft 28
swill 28
drench 29
pull 29
suck 29
water 30
liquor 33
potation 45
beverage 54
drink 54
real    0m6.125s
user    0m5.670s
sys     0m0.410s
27 quaff
     27 snort
     28 draft
     28 guzzle
     28 pull
     28 sip
     28 sup
     28 swig
     28 swill
     29 suck
     30 drench
     31 liquor
     46 potation
     54 beverage
     54 drink
real    0m1.047s
user    0m1.022s
sys     0m0.019s
blot sponge 91
rub 92
blemish 105
stigmatize 105
freckle 106
speck 108
splotch 108
blotch 121
scratch 126
stain 128
brand 129
cut 134
mark 134
spot 134
off 172
blot 212
real    0m18.693s
user    0m17.216s
sys     0m1.427s
     47 blur
     48 blacken
     48 scorch
     51 smudge
     51 smutch
     54 spot
     56 smoke
     62 smirch
     65 brand
     65 smear
     65 taint
     66 stain
     66 tarnish
     69 sponge
    157 blot

real    0m1.132s
user    0m1.105s
sys     0m0.023s
congress seance 57
sitting 57
meet 58
convention 60
exchange 60
session 60
intercourse 61
concourse 64
connection 65
party 65
synod 66
diet 68
assembly 73
council 83
meeting 86
congress 156
real    0m11.503s
user    0m9.291s
sys     0m1.000s
 52 gathering
     54 commerce
     56 meet
     58 conclave
     58 session
     58 sitting
     60 convention
     62 connection
     62 synod
     65 concourse
     68 diet
     72 assembly
     77 meeting
     79 council
    156 congress
    312

real    0m0.888s
user    0m0.860s
sys     0m0.019s
cruelly intolerably 27
openly 27
sadly 27
sorely 27
unashamedly 27
unduly 27
deathly 28
flagrantly 28
miserably 28
terribly 28
grievously 29
horribly 29
improperly 29
awfully 30
dreadfully 30
painfully 31

real    0m1.268s
user    0m0.921s
sys     0m0.340s
nothing
dishonourably nothing nothing
dwarfish mean 40
narrow 40
paltry 40
scant 40
scrawny 40
straitened 40
lean 41
spare 41
puny 42
slight 42
poor 43
thin 44
small 46
dwarfed 54
stunted 55
meager 56

real    0m4.394s
user    0m3.916s
sys     0m0.462s
40 slender
     41 jejune
     41 lean
     41 mean
     41 paltry
     41 scrawny
     41 spare
     42 slight
     43 poor
     43 puny
     43 thin
     44 small
     56 stunted
     57 meager
     59 dwarfish
    118

real    0m1.058s
user    0m1.037s
sys     0m0.018s
filamentiferous nothing nothing
happiness high-spirits 34
cheerfulness 35
transport 36
delight 39
elation 39
joy 41
comfort 43
pleasure 43
cheer 45
cloud-nine 48
paradise 48
seventh-heaven 48
heaven 49
felicity 57
high 58
happiness 98

real    0m5.130s
user    0m4.330s
sys     0m0.753s
30 well-being
     31 bliss
     31 comfort
     31 ease
     32 elation
     32 exhilaration
     32 gaiety
     32 glee
     33 exuberance
     35 delight
     35 joy
     36 cheerfulness
     39 cheer
     57 felicity
     72 happiness
    144

real    0m1.059s
user    0m1.046s
sys     0m0.011s
heathendom allotheism 4
animism 4
heathenism 4
idolatry 4
paganism 4

real    0m0.324s
user    0m0.246s
sys     0m0.076s
 1 weakness
      1 worship
      1 yearning
      2 barbarism
      2 dark
      2 fetishism
      2 monolatry
      2 savagery
      3 demonolatry
      3 idolism
      5 animism
      5 heathendom
      5 heathenism
      5 idolatry
      5 paganism
     10


real    0m0.835s
user    0m0.816s
sys     0m0.017s
ignoramus
catechumen 24
debutant 24
entrant 24
fledgling 24
freshman 24
neophyte 24
novice 24
probationer 24
raw-recruit 24
rookie 24
abecedarian 25
recruit 25
greeny 28
tenderfoot 33
fool 34
greenhorn 34

real    0m4.506s
user    0m4.054s
sys     0m0.437s
25 entrant
     25 fledgling
     25 freshman
     25 neophyte
     25 novice
     25 novitiate
     25 probationer
     25 recruit
     25 rookie
     26 abecedarian
     27 tyro
     30 fool
     34 tenderfoot
     35 greenhorn
     63 ignoramus
    126


real    0m0.917s
user    0m0.900s
sys     0m0.015s
ill rank 406
blow 455
mean 460
strong 460
rough 463
close 478
short 487
mind 585
hard 658
black 681
bad 685
get 894
high 979
ill 1028
take 1409
down 1784
real    0m54.622s
user    0m48.095s
sys     0m6.248s
77 unfavorable
     78 dark
     78 infelicitous
     78 unsuitable
     80 unhappy
     81 inauspicious
     82 inexpedient
     83 untoward
     86 ill-advised
     86 unfortunate
     90 evil
    105 improper
    111 wrong
    122 ill
    129 bad
    986
real    0m1.117s
user    0m1.104s
sys     0m0.009s
ill-tempered  time ./synhys "ill--tempered"|sort -nk2|tail -16
sullen 8
querulous 9
surly 9
disagreeable 10
mean 10
ill--natured 11
perverse 11
sour 11
ugly 12
ill--humored 13
cantankerous 14
irritable 15
peevish 16
petulant 16
snappish 16
testy 17
real    0m1.713s
user    0m1.485s
sys     0m0.222s
     15 crabby
     15 fractious
     15 splenetic
     16 crusty
     16 irritable
     16 waspish
     17 crabbed
     17 peevish
     17 petulant
     17 snappish
     18 cranky
     18 cross
     18 testy
     22 bad-tempered
     24 ill-tempered
     48
real    0m1.048s
user    0m1.019s
sys     0m0.024s
indoctrinate drill 11
school 11
teach 12
imbue 13
impregnate 13
infix 13
infuse 13
inoculate 13
impress 14
instill 14
program 14
brainwash 16
implant 16
catechize 17
condition 17
inculcate 18

real    0m1.815s
user    0m1.628s
sys     0m0.179s

     12 teach
     14 imbue
     14 impregnate
     14 impress
     14 infix
     14 infuse
     14 inoculate
     14 program
     15 instill
     16 implant
     17 brainwash
     17 condition
     18 catechize
     19 inculcate
     25 indoctrinate
     50

real    0m0.545s
user    0m0.536s
sys     0m0.007s
manure dress 9
dressing 9
compost 10
muck 10
BM 13
bowel-movement 13
crap 13
feces 13
feculence 14
movement 14
stool 14
droppings 15
ordure 15
sewage 15
shit 15
dung 20


real    0m1.776s
user    0m1.573s
sys     0m0.197s
 14 coprolite
     14 crap
     14 defecation
     14 feces
     14 jakes
     14 movement
     14 stool
     14 turd
     15 sewerage
     15 shit
     16 ordure
     16 sewage
     21 dung
     21 guano
     29 manure
     58

real    0m1.048s
user    0m1.028s
sys     0m0.018s
mistrustful
incredulous 9
wary 12
scrupulous 15
untrusting 15
agnostic 16
doubting 18
leery 18
suspecting 18
questioning 19
distrustful 20
shy 20
suspicious 20
dubious 21
uncertain 21
doubtful 22
skeptical 22
real    0m0.913s
user    0m0.780s
sys     0m0.129s

     10 incredulous
     16 scrupulous
     17 agnostic
     17 doubtful
     17 dubious
     17 uncertain
     18 shy
     18 skeptical
     19 leery
     19 questioning
     19 suspecting
     19 wary
     21 distrustful
     21 mistrustful
     21 suspicious
     42

real    0m0.537s
user    0m0.533s
sys     0m0.003s
mustachio
nothing nothing
quinsy nothingnothing
realist unidealistic 24
reasonable 25
sane 25
rational 28
sensible 30
materialistic 32
utilitarian 32
positivistic 37
sound 37
idealistic 39
pragmatic 39
practical 41
realist 46
unromantic 65
naturalistic 86
realistic 136

real    0m2.116s
user    0m1.776s
sys     0m0.310s
     21 secular
     21 theistic
     22 empirical
     22 mechanistic
     22 scholastic
     23 humanist
     23 naturalistic
     23 practical
     24 eclectic
     24 pragmatist
     25 positivist
     25 utilitarian
     36 pragmatic
     37 realistic
     42 realist
     84

real    0m0.583s
user    0m0.548s
sys     0m0.011s
succumb faint 34
die 35
give-out 35
peter-out 37
submit 37
blow 38
collapse 40
run-out 41
pass 45
sink 47
take 49
come 52
drop 52
fall 54
go 80
succumb 92

real    0m10.702s
user    0m10.041s
sys     0m0.619s
     28 pant
     28 resign
     29 wilt
     30 flag
     31 blow
     31 die
     32 droop
     32 expire
     32 yield
     34 faint
     34 go
     34 pass
     44 sink
     48 drop
     86 succumb
    172

real    0m0.597s
user    0m0.587s
sys     0m0.009s
teacake none none
theoretically on-paper 1
principle 1
ideally 2
real    0m0.185s
user    0m0.157s
sys     0m0.027s
      1 substructure
      1 tenet
      1 theorem
      1 thoroughly
      1 totally
      1 truism
      1 truth
      1 undercarriage
      1 usage
      1 viewpoint
      1 vocation
      1 wholly
      1 written
      3 ideally
      3 theoretically
      6


real    0m0.529s
user    0m0.521s
sys     0m0.006s
toboggan
flit 19
skid 19
sled 19
sleigh 19
coast 20
fly 20
sail 20
slip 20
glide 23
slither 23
skim 24
slide 24
sweep 25


real    0m1.182s
user    0m1.099s
sys     0m0.080s
     19 glide
     19 glissade
     19 sail
     19 skate
     19 skateboard
     19 ski
     19 skid
     19 skim
     19 sled
     19 sleigh
     19 slide
     19 slip
     19 slither
     19 sweep
     19 toboggan
     38

real    0m0.539s
user    0m0.532s
sys     0m0.007s
venison aspic 12
barbecue 12
forcemeat 12
scrapple 12
flesh 13
game 13
hash 13
jerky 13
joint 13
meat 13
mince 13
roast 13

real    0m0.820s
user    0m0.720s
sys     0m0.097s
      2 work
      3 tea
      4 stew
     13 aspic
     13 barbecue
     13 civet
     13 flesh
     13 game
     13 hash
     13 jerky
     13 joint
     13 meat
     13 mince
     13 roast
     13 venison
     26

real    0m0.548s
user    0m0.539s
sys     0m0.008s



$ time ./synhys actuary|sort -nk2|tail -15
bookkeeper 12
calculator 12
accountant 13
annuity 13
bail-bond 13
insurance-company 13
insurance-policy 13
mutual-company 13
social-security 13
assurance 14
bond 14
deductible 14
policy 14
stock-company 14
insurance 15

real    0m1.186s
user    0m0.931s
sys     0m0.250s




---------------topic-----------------


Rough counts of number of occurrences of words (in 'words') within mobythes.txt. (m=/home/ddailey/public_html/moby/mthes/thesaurus)

Let's take an example.
$ wc $m
   30260   645505 24823017 mobythes.txt

There are 30260 primary entries in Moby Thesaurus, each with an average of 20 words as its "synonyms."

w=/home/ddailey/public_html/words)

$ time for i in `head -20 $w`; do echo $i `grep -c [^[:alpha:]]$i[^[:alpha:]] $m`; done
a 4977
aardvark 13
aback 60
abacus 4
abaft 3
abalienate 31
abalienation 23
abandon 253
abandoned 254
abandonment 73
abase 32
abased 1
abasement 31
abash 46
abashed 49
abate 188
abatement 85
abattoir 4
abba 15
abbacy 9

real    0m3.118s
user    0m2.962s
sys     0m0.114s

--------------topic--------------
handling hyphens

$ cat $m|sed 's/-/--/g'>double-mthes
$ cat double-mthes|sed 's/\ /-/g'>hyph-thes
Since there are a lot of "multi-word" and hyphenated "words" in the vocabulary listed in mobythes.txt (as either primary entries or secondary synonyms) , the resource has been rewritten with hyphens replaced by double--hyphens and then spaces (as in "adding machine") replaced by hyphens to ease processing (by "for i in `unixcommand`" constructs)


Consider one of the simpler ones 'abacus':
It appears in four of the listings
$ sed -n /abacus/= mobythes.txt
281
3374
10064
27078

$ grep -n abacus $m|sed 's/,/,\ /g'
281:adding machine, Comptometer, abacus, adding, analog computer, arithmograph, arithmometer, calculating machine, calculator, cash register, computation, counter, difference engine, digital computer, electronic computer, listing machine, pari-mutuel machine, quipu, rule, slide rule, sliding scale, suan pan, tabulator
3374:calculator, CA, CPA, Comptometer, abacist, abacus, accountant, accountant general, actuary, adding, adding machine, analog computer, arithmograph, arithmometer, auditor, bank accountant, bank examiner, bookkeeper, calculating machine, cash register, certified public accountant, chartered accountant, clerk, comptroller, computation, computer, controller, cost accountant, cost keeper, counter, difference engine, digital computer, electronic computer, estimator, figurer, intriguer, journalizer, listing machine, machinator, maneuverer, manipulator, pari-mutuel machine, quipu, reckoner, recorder, registrar, rule, schemer, slide rule, sliding scale, statistician, strategist, suan pan, tabulator, tactician, wire-puller
10064:frieze, abacus, acanthus, annulet, apophyge, architrave, astragal, beading, beak, bell, billet, boss, capstone, cartouche, cavetto, cima, cinquefoil, conge, console, coping, corbel, cornice, crown, cusp, cyma, dentil, drip, drop, fascia, fillet, finial, foil, fret, frontispiece, head, header, headpiece, keystone, lintel, listel, molding, ovolo, patera, pendant, quatrefoil, reed, sconce, splay, taenia, terminal, torus, treenail, trefoil, tympanum, volute
27078:topping, Olympian, a cut above, abacus, above, acanthus, aerial, ahead, airy, altitudinous, annulet, architrave, ascendant, ascending, aspiring, astragal, bell, better, capping, capstone, chosen, cima, colossal, console, consummating, coping, corbel, cornice, crown, crowning, culminating, cyma, dentil, distinguished, dominating, drip, drop, eclipsing, elevated, eminent, ethereal, exalted, exceeding, excellent, excelling, finer, finial, first-chop, first-class, first-rate, frieze, frontispiece, frosting, greater, haughty, head, header, heading, headpiece, high, high-pitched, high-reaching, high-set, high-up, higher, icing, in ascendancy, in the ascendant, keystone, lintel, lofty, major, marked, monumental, mounting, of choice, one up on, outstanding, outtopping, over, overlooking, overtopping, prominent, rare, rivaling, sconce, soaring, spiring, steep, sublime, super, superior, superlative, supernal, surpassing, taenia, tip-top, top-drawer, top-hole, top-notch, topflight, topless, toplofty, tops, towering, towery, transcendent, transcendental, transcending, treenail, tympanum, uplifted, upper, upreared

headings are
$ grep -n abacus $m|sed s/,.*//
281:adding machine
3374:calculator
10064:frieze
27078:topping

$ cat $m|sed 's/-/--/g'>double-mthes -- Doubling the hyphens so that hyphens are preserved; next we'll turn spaces in phrases into hyphens, so that individual phrases may be dealt with in for i in `awk` constructions without spaces causing troubles.

$ wc $m double-mthes
   30260   645505 24823017 /home/ddailey/public_html/moby/mthes/mobythes.txt
   30260   645505 24905488 double-mthes
   60520  1291010 49728505 total



$ cat double-mthes|sed 's/\ /-/g'>hyph-thes
$ h=/home/ddailey/public_html/moby/mthes/hyph-thes

$ wc $m double-mthes $h
   30260   645505 24823017 /home/ddailey/public_html/moby/mthes/mobythes.txt
   30260   645505 24905488 double-mthes
   30260    30260 24905488 hyph-thes
   90780  1321270 74633993 total

]$ head -2 $m
a cappella,abbandono,accrescendo,affettuoso,agilmente,agitato,amabile,amoroso,appassionatamente,appassionato,brillante,capriccioso,con affetto,con agilita,con agitazione,con amore,crescendo,decrescendo,diminuendo,dolce,forte,fortissimo,lamentabile,leggiero,morendo,parlando,pianissimo,piano,pizzicato,scherzando,scherzoso,sordo,sotto voce,spiccato,staccato,stretto,tremolando,tremoloso,trillando
a la mode,advanced,avant-garde,chic,contemporary,dashing,exclusive,far out,fashionable,fashionably,forward-looking,in,in the mode,mod,modern,modernistic,modernized,modish,modishly,newfashioned,now,present-day,present-time,progressive,soigne,soignee,streamlined,stylish,stylishly,tony,trendy,twentieth-century,ultra-ultra,ultramodern,up-to-date,up-to-datish,up-to-the-minute,vogue,voguish,way out
[ddailey@granite mthes]$ head -2 double-mthes
a cappella,abbandono,accrescendo,affettuoso,agilmente,agitato,amabile,amoroso,appassionatamente,appassionato,brillante,capriccioso,con affetto,con agilita,con agitazione,con amore,crescendo,decrescendo,diminuendo,dolce,forte,fortissimo,lamentabile,leggiero,morendo,parlando,pianissimo,piano,pizzicato,scherzando,scherzoso,sordo,sotto voce,spiccato,staccato,stretto,tremolando,tremoloso,trillando
a la mode,advanced,avant--garde,chic,contemporary,dashing,exclusive,far out,fashionable,fashionably,forward--looking,in,in the mode,mod,modern,modernistic,modernized,modish,modishly,newfashioned,now,present--day,present--time,progressive,soigne,soignee,streamlined,stylish,stylishly,tony,trendy,twentieth--century,ultra--ultra,ultramodern,up--to--date,up--to--datish,up--to--the--minute,vogue,voguish,way out
$ head -2 hyph-thes
a-cappella,abbandono,accrescendo,affettuoso,agilmente,agitato,amabile,amoroso,appassionatamente,appassionato,brillante,capriccioso,con-affetto,con-agilita,con-agitazione,con-amore,crescendo,decrescendo,diminuendo,dolce,forte,fortissimo,lamentabile,leggiero,morendo,parlando,pianissimo,piano,pizzicato,scherzando,scherzoso,sordo,sotto-voce,spiccato,staccato,stretto,tremolando,tremoloso,trillando
a-la-mode,advanced,avant--garde,chic,contemporary,dashing,exclusive,far-out,fashionable,fashionably,forward--looking,in,in-the-mode,mod,modern,modernistic,modernized,modish,modishly,newfashioned,now,present--day,present--time,progressive,soigne,soignee,streamlined,stylish,stylishly,tony,trendy,twentieth--century,ultra--ultra,ultramodern,up--to--date,up--to--datish,up--to--the--minute,vogue,voguish,way-out

$ grep -n abacus $m|sed s/,.*//
281:adding machine
3374:calculator
10064:frieze
27078:topping
$ grep -n abacus $h|sed s/,.*//
281:adding-machine
3374:calculator
10064:frieze
27078:topping

$ cat syns
m=/home/ddailey/public_html/moby/mthes/mobythes.txt
for i in `grep $1 $m|sed 's/,.*//'`;
do echo $i `grep $i $m|grep -c $1`;
done


-----------------topic----------------
Number of synonyms per word

$ cat synhys
m=/home/ddailey/public_html/moby/mthes/hyph-thes
for i in `grep $1 $m|sed 's/,.*//'`;
do echo $i `grep $i $m|grep -c $1`;
done

$ ./synhys abacus
adding-machine 2
calculator 2
frieze 2
topping 1



$ cat synhys
m=/home/ddailey/public_html/moby/mthes/hyph-thes
for i in `grep [^[:alpha:]]$1[^[:alpha:]] $m|sed 's/,.*//'`;
do echo $i `grep [^[:alpha:]]$i[^[:alpha:]] $m|grep -c $1`;
done

a problem for words (like 'ill' ) that have so many entries:
grep [^[:alpha:]]ill[^[:alpha:]] $m|sed 's/,.*//'|wc
   1028    1028    9657

This includes hyphenations like ill-mannered






number of entries per primary listing

$ head -10 $h|awk 'BEGIN {FS = ","; RS = "\n"} { print $1, $NF,NF}'
a-cappella trillando 39
a-la-mode way-out 40
a-priori synthetic 29
A--bomb thermonuclear-warhead 19
ab-ovo underlying 48
Abaddon underworld 37
abandon zeal 347
abandoned zealous 297
abase truckle 35
abasement stripping-of-rank 34

of the 30260 primary entries, the distribution of number of words per entry is like this:
$ cat $h|awk 'BEGIN {FS = ","; RS = "\n"} { print NF}'|sort -n|uniq -c
words having this many entries (e.g. only three primary words have only 18 listed synonyms, while there is one primary entry that has 1448)
      3 18
    465 19
    495 20
    429 21
    439 22
    421 23
    445 24
    420 25
    446 26
    386 27
    449 28
    397 29
    421 30
......
      1 953
      1 965
      1 974
      1 987
      1 993
      1 1007
      1 1025
      1 1108
      1 1152
      1 1448
$ cat $h|awk 'BEGIN {FS = ","; RS = "\n"} { print NF, $1}'|sort -n|tail -20 >populars

846 cross
853 head
865 measure
879 charge
882 cast
886 work
890 hold
893 flat
905 close
948 light
953 point
965 pass
974 check
987 line
993 break
1007 color
1025 run
1108 turn
1152 set
1448 cut


Search for , or ^ before word

$ time for i in `awk '{print $2}' populars`; do echo $i": "`grep -E "(,|^)$i," $h|sed 's/,.*//'|wc -l`;  done
ill: 122
cross: 539
head: 591
measure: 514
charge: 579
cast: 653
work: 451
hold: 594
flat: 564
close: 644
light: 553
point: 631
pass: 632
check: 698
line: 678
break: 696
color: 434
run: 749
turn: 703
set: 919
cut: 1120

real    0m1.467s
user    0m1.287s
sys     0m0.173s

$ paste popular2 populars
ill: 122        001 ill
cross: 539      846 cross
head: 591       853 head
measure: 514    865 measure
charge: 579     879 charge
cast: 653       882 cast
work: 451       886 work
hold: 594       890 hold
flat: 564       893 flat
close: 644      905 close
light: 553      948 light
point: 631      953 point
pass: 632       965 pass
check: 698      974 check
line: 678       987 line
break: 696      993 break
color: 434      1007 color
run: 749        1025 run
turn: 703       1108 turn
set: 919        1152 set
cut: 1120       1448 cut

consider cases where measure for example occurs in one list but not in the other
first list is number of primary listings in which a target word occurs; second is number of words listed as synonyms to the target word.

$ grep -E "(,|^)measure," $h|sed 's/,.*//'|wc -l
514







$ cat synhys
m=/home/ddailey/public_html/moby/mthes/hyph-thes
for i in `grep [^[:alpha:]]$1[^[:alpha:]] $m|sed 's/,.*//'`;
do echo $i `grep [^[:alpha:]]$i[^[:alpha:]] $m|grep -c $1`;
done


$ ./synhys topping
above 30
aerial 27
ahead 26
airy 29
ascendant 30
ascending 28
aspiring 28
better 29
capital 14
capping 31
chosen 28
close 34
closure 16
colossal 26
completion 22
conclusion 53
consummation 50
crowning 12
culminating 9
culmination 54
distinguished 34
elevated 32
eminent 51
end 80
ending 61
ethereal 28
exalted 28
excellent 34
finer 25
finish 54
finishing 22
first-class 0
first-rate 0
frosting 2
fulfillment 7
greater 26
haughty 27
heading 4
high-pitched 0
high 53
higher 27
icing 2
jumping 4
leaping 7
lofty 29
major 29
marked 27
maturation 14
maturity 14
monumental 26
mounting 28
neat 3
Olympian 27
outstanding 33
over 67
perfection 20
prancing 4
prominent 33
rare 29
realization 7
rivaling 25
soaring 27
steep 26
steeplechase 4
striking 9
sublime 29
super 29
superior 38
superlative 37
supernal 27
surpassing 27
termination 59
terminus 57
tip-top 0
top-notch 0
topflight 8
topless 26
toplofty 27
topping-off 14
towering 29
transcendent 26
transcendental 30
transcending 25
uplifted 26
upper 31
vaulting 5
windup 22

     

Finding the vocabulary of the thesaurus
$ head -5 $h|tr ',' '\n'|sort|uniq -c
      1 abbandono
      1 ab-initio
      1 A--bomb
      1 aborigine
      1 ab-ovo
      1 a-cappella
      1 accrescendo
      1 advanced
      1 affettuoso
      1 a-fortiori
      1 afresh
      1 again
      1 agilmente
      1 agitato
      1 a-la-mode
      1 amabile
      1 amoroso
      1 analytic
      1 anew
      1 a-posteriori
      1 appassionatamente
      1 appassionato
      1 a-priori
      1 as-new
      1 at-first
      1 atomic-bomb
      1 atomic-warhead
      1 at-the-start
      1 avant--garde
      1 back
      1 backward
      1 basal
      1 basic
      1 before-everything
....
      1 trendy
      1 trillando
      1 twentieth--century
      1 ultramodern
      1 ultra--ultra
      1 underlying
      1 up--to--date
      1 up--to--datish
      1 up--to--the--minute
      1 vogue
      1 voguish
      1 way-out


$ head -5 $h|tr ',' '\n'|sort|uniq -c|wc
    175     350    3263

$ cat $h|tr ',' '\n'|sort|uniq -c>thesvocfreq

$ wc thesvocfreq
 103306  206611 2012671 thesvocfreq

103306 separate "words"
$ grep -c "-" thesvocfreq
41047

$ cat thesvocfreq| awk '{print $2}'>thesvocab
[ddailey@granite mthes]$ wc thesvocab
 103306  103305 1186223 thesvocab

real    0m7.934s
user    0m6.437s
sys     0m1.291s
[ddailey@granite mthes]$ time for i in `head -100 thesvocab`; do ./worhyp $i; done

$ time for i in `sed -n 100p  thesvocab`; do ./worhyp $i; done
abiding:abiding,staying,steadfast,stable,permanent,enduring,constant,remaining,persistent,continuing,lasting,immutable,perpetual,unfading,durable,fixed,firm,unfailing,unchanging,unvarying,sustained,steady,solid,unchanged,unchangeable

real    0m0.258s
user    0m0.228s
sys     0m0.029s
$ for i in `sed -n 1001,2000p  thesvocab`; do ./worhyp $i; done>>super25
[ddailey@granite mthes]$ wc super25
  1989   1989 526100 super25
[ddailey@granite mthes]$ for i in `sed -n 2001,5000p  thesvocab`; do ./worhyp $i; done>>super25
$ wc super25
   4963    4963 1319000 super25
at-an-end:wiped-out,washed-up,finished,done,dead,at-an-end,zapped,wound-up,through-with,through,terminated,SOL,shot,settled,set-at-rest,perfected,over,kaput,fini,extinct,expunged,ended,done-with,done-for,deleted
at-an-impasse:stumped,stuck,perplexed,nonplussed,mystified,bewildered,baffled,at-a-standstill,at-a-stand,at-a-nonplus,at-an-impasse,at-a-loss,puzzled,confounded,thrown,on-tenterhooks,muddled,licked,in-suspense,in-a-dilemma,fuddled,floored,dazed,buffaloed,beat
[ddailey@granite mthes]$ for i in `sed -n 5001,10000p  thesvocab`; do ./worhyp $i; done>>super25
bolt-down:wolf-down,wolf,stuff,raven,live-to-eat,guzzle,guttle,gulp-down,gulp,gormandize,gorge,gobble,gluttonize,engorge,devour,cram,bolt-down,bolt,batten,glut,ingurgitate,swill,slop,jam,drench
[ddailey@granite mthes]$ wc super25
   9932    9932 2597279 super25
chokey:jug,hoosegow,chokey,can,stir,quod,prison,lockup,calaboose,slammer,pokey,cooler,clink,pen,coop,jail,incarcerate,immure,constrain,confine,penitentiary,keep,brig,bastille,training-school
$ for i in `sed -n 10001,15000p  thesvocab`; do ./worhyp $i; done>>super25
[ddailey@granite mthes]$ wc super25
  14905   14905 3854586 super25
$ for i in `sed -n 15001,20000p  thesvocab`; do ./worhyp $i; done>>super25
crane:crane,lifter,pothook,spit,gridiron,crook,turnspit,trivet,tripod,tongs,salamander,poker,griller,grill,griddle,grid,grating,grate,fire-tongs,fire-hook,firedog,damper,coal-tongs,chain,andiron
[ddailey@granite mthes]$ wc super25
  19881   19881 5184717 super25
$ for i in `sed -n 20001,25000p  thesvocab`; do ./worhyp $i; done>>super25
District-of-Columbia:VA,USIA,TVA,Tennessee-Valley-Authority,Smithsonian-Institution,SEC,Railroad-Retirement-Board,Panama-Canal-Company,OEO,NSF,NLRB,National-Security-Council,National-Science-Foundation,National-Mediation-Board,NASA,Library-of-Congress,Interstate-Commerce-Commission,Indian-Claims-Commission,ICC,GSA,Government-Printing-Office,General-Services-Administration,General-Accounting-Office,GAO,FTC
[ddailey@granite mthes]$ wc super25
  24846   24846 6481936 super25
$ for i in `sed -n 25001,30000p  thesvocab`; do ./worhyp $i; done>>super25
excursively:to-one-side,sidewise,sideways,sidelong,sideling,on-one-side,obliquely,indirectly,excursively,divergently,divagationally,digressively,deviously,deviately,circuitously,at-an-angle,windward,weather,slantwise,slantways,skirting,sidling,sideway,sidewards,sideward
[ddailey@granite mthes]$ wc super25
  29808   29808 7779202 super25

$ for i in `sed -n 30001,35000p  thesvocab`; do ./worhyp $i; done>>super25
fruit-compote:yield,work,vintage,stone-fruit,second-crop,returns,profits,production,product,produce,proceeds,output,manufacture,make,income,harvest,furnish,fruit-soup,fruit-compote,fruit-cocktail,fruit,fructify,effect,drupe,crop
[ddailey@granite mthes]$ wc super25
  34745   34745 9061576 super25
$ for i in `sed -n 35001,40000p  thesvocab`; do ./worhyp $i; done>>super25
harvester:planter,haymaker,harvester,cultivator,yeoman,truck-farmer,tree-farmer,tiller,tenant-farmer,tea--planter,sower,sharecropper,rustic,reaper,ranchman,rancher,raiser,plowman,plowboy,picker,peasant-holder,peasant,muzhik,kulak,kolkhoznik
[ddailey@granite mthes]$ wc super25
   39673    39673 10322501 super25
$ cp super25 hold25
[ddailey@granite mthes]$ wc *25
   39673    39673 10322501 hold25
   39673    39673 10322501 super25
   79346    79346 20645002 total
$ for i in `sed -n 40001,50000p  thesvocab`; do ./worhyp $i; done>>super25
largeheartedly:profusely,openhandedly,liberally,largeheartedly,handsomely,greatheartedly,generously,bigheartedly,without-stint,with-open-hands,with-both-hands,unstintingly,unsparingly,unselfishly,ungrudgingly,openheartedly,lavishly,hospitably,graciously,freely,freeheartedly,freehandedly,bountifully,bounteously,abundantly
[ddailey@granite mthes]$ wc super25
   49573    49573 12949589 super25
$ for i in `sed -n 50001,60000p  thesvocab`; do ./worhyp $i; done>>super25
nonexistence:nonexistence,nonoccurrence,deprivation,absence,void,vacuity,nullity,nihility,emptiness,vacuum,vacancy,nothingness,unreality,unactuality,not--being,nonsubsistence,nonreality,nonentity,nonbeing,negativity,negativeness,negation,blank,inanity,want
[ddailey@granite mthes]$ wc super25
   59454    59454 15550280 super25
$ cp super25 hold25
[ddailey@granite mthes]$ wc hold25
   59454    59454 15550280 hold25
$ for i in `sed -n 60001,70000p  thesvocab`; do ./worhyp $i; done>>super25
prison-ward:X-ray,ward,treatment-room,therapy,surgery,semi--private-room,recovery-room,private-room,prison-ward,pharmacy,operating-room,nursery,maternity-ward,labor-room,laboratory,isolation,intensive-care,hospital-room,fever-ward,examining-room,emergency,dispensary,delivery-room,consultation-room,clinic
[ddailey@granite mthes]$ wc super25
   69380    69380 18200017 super25
]$ for i in `sed -n 70001,90000p  thesvocab`; do ./worhyp $i; done>>super25
tercet:triplet,tercet,sextet,septet,octet,octave,note,line,book,verse,strain,sestet,refrain,measure,envoi,chorus,canto,troika,triumvirate,tristich,trio,tetrastich,terzetto,terza-rima,syllable
[ddailey@granite mthes]$ wc super25
   89213    89213 23364189 super25

$ wc thesvocab
 103306  103305 1186223 thesvocab
[ddailey@granite mthes]$ for i in `sed -n 90001,103306p  thesvocab`; do ./worhyp $i; done>>super25
zymochemistry:zymurgy,zymochemistry,zoochemistry,ultramicrochemistry,topochemistry,thermochemistry,theoretical-chemistry,soil-chemistry,radiochemistry,psychobiochemistry,phytochemistry,physicochemistry,physical-chemistry,pharmacochemistry,petrochemistry,pathochemistry,nuclear-chemistry,mineralogical-chemistry,macrochemistry,lithochemistry,inorganic-chemistry,immunochemistry,iatrochemistry,hydrochemistry,geological-chemistry
[ddailey@granite mthes]$ wc super25
  102146   102146 26810107 super25

]$ time for i in `head -1000 thesvocab`; do echo $i"---------";grep ,$i, super25|tr ',' '\n'|sort|uniq -c|sort -n|tail -25;done>doubleSyns

real    0m43.034s
user    0m26.505s
sys     0m8.686s
[ddailey@granite mthes]$ more doubleSyns
$100--a--plate-dinner---------
      1 whistle--stopping:whistle--stopping
      8 $100--a--plate-dinner
      8 campaign-dinner
\\

$ d=~/public_html/moby/mthes/doubleSyns
colons were not properly dealt with
$ grep  : $d|more
      1 whistle--stopping:whistle--stopping
      1 B:B
      1 unanticipatedly:unexpectedly
$ grep -c : $d
15696
$ wc doubleSyns
 2306753  4510198 42414932 doubleSyns

[ddailey@granite mthes]$ head -200 $d
abacus---------
      1 suan-pan:tabulator
      9 abacus
      9 adding
      9 adding-machine
      9 analog-computer
      9 arithmograph
      9 arithmometer
      9 calculating-machine
      9 calculator

[ddailey@granite mthes]$ head -200 $d|awk '{FS = ":"} {print $1}'
abacus---------
      1 suan-pan
      9 abacus
      9 adding
      9 adding-machine
      9 analog-computer
      9 arithmograph
      9 arithmometer
      9 calculating-machine
      9 calculator

$ time awk '{FS = ":"} {print $1}' $d>correctSyns

real    0m2.244s
user    0m2.137s
sys     0m0.079s
$ wc correctSyns doubleSyns
 2306753  4510198 42250537 correctSyns
 2306753  4510198 42414932 doubleSyns
 4613506  9020396 84665469 total
$ grep -c : correctSyns
0

$ mv correctSyns doubleSyns

putting colon at beginning of each lexical entry
$ time sed 's/^[[:alpha:][:punct:]]/:&/' $d>correctSyns

real    0m1.476s
user    0m1.385s
sys     0m0.087s
$ wc correctSyns doubleSyns
 2306753  4510198 42353841 correctSyns
 2306753  4510198 42250537 doubleSyns
 4613506  9020396 84604378 total

Note that 42353841-42250537= 103304
and wc
103306   103305  1186223 /home/ddailey/public_html/moby/mthes/thesvocab -- hence about one : per entry has been added as expected.
]$ mv correctSyns doubleSyns

$ time sed 's/\ *[0-9]*\ //' doubleSyns|tr '\n' ','|tr ':' '\n'>correctSyns

real    0m3.464s
user    0m3.365s
sys     0m0.070s
$ wc correctSyns
  103304   103304 24726257 correctSyns
$ time sed 's/,$//;s/---------[,]*/:/' correctSyns>bestFroms

real    0m0.637s
user    0m0.559s
sys     0m0.069s
$ wc bestFroms
  103304   103304 23708364 bestFroms

15147 undefined entries (these have no pointers to them, but probably they have pointers from, since they appear)
$ grep :$ bestFroms |wc
  15147   15147  198042
$ grep : bestFroms |wc
 103304  103304 23708364
$ b=/home/ddailey/public_html/moby/mthes/bestFroms
$ head -2000 $b|grep :$ (excerpt)
alembicated:
Alentejo:
alentours:
alerion:
[ddailey@granite mthes]$ head -2000 $b|grep :$|wc
    577     577    7648
[ddailey@granite mthes]$ grep alembicated $b
alembicated:


h=/home/ddailey/public_html/moby/mthes/hyph-thes
$ s=~/public_html/moby/mthes/super25

$ grep alembicated $s
alembicated:worthy,white--haired,well--liked,well--beloved,venerated,venerable,valued,valuable,utter,unspoiled,unrelieved,unqualified,unnatural,unmitigated,unequivocal,undeniable,unconscionable,unbearable,treasured,total,thoroughgoing,thorough,the-veriest,sweets,sweetkins


$ grep :$ bestFroms>undeFroms
[ddailey@granite mthes]$ wc undeFroms
 15147  15147 198042 undeFroms
$ shuf undeFroms -n 100|sed s/,.*//>ranundeF100
[ddailey@granite mthes]$ cat ranundeF100
board-of-aldermen
mathematical-physics
but-good
engineering-chemistry
europium
sand-pile
ripper
graininess


$ for i in `cat ranundeF100`; do grep -c ^$i: $s; done|sort|uniq -c
      3 0
     97 1


Most words (97/100 in ranundeF100) that are undefined in bestFroms are defined as primary entries in super25=$s

Those that aren't:
$ time for i in `cat ranundeF100`; do echo $i;grep ^$i: $s|sed s/:.*//; done|uniq -c|grep 1
      1 stretcher-bearer
      1 for-the-most-part
      1 cross-hatching
real    0m2.445s
user    0m1.589s
sys     0m0.843s
grep stretcher-bearer $h
stretcher-bearer,Aquarius,Ganymede,Hebe,bearer,...
[ddailey@granite mthes]$ grep for-the-most-part $b
for-the-most-part:
[ddailey@granite mthes]$ grep for-the-most-part $s
[ddailey@granite mthes]$ grep for-the-most-part $h
for-the-most-part,a-fortiori,above-all,...

$ time for i in `cat undeFroms`; do grep ^$i: $s; done>undeFromTos

real    3m43.213s
user    2m24.040s
sys     1m17.334s


$ wc undeFromTos
  14221   14221 3836708 undeFromTos

Hence of the  15147  words undefined in bestFroms, most (14221) have primary entries in super25=$s

$ cat undeFromTos|awk 'BEGIN {FS = ":"} {print $1":"}'>undeFromToVoc
[ddailey@granite mthes]$ wc undeFromToVoc
 14221  14221 185167 undeFromToVoc
$ head undeFromToVoc
aa:
AA:
AA-gun:
AA-radar:
aardvark:
aardwolf:
AA-target-rocket:
AB:
abalienation:
abampere:


The idea is this: let merge in all those definitions that are in $s but are not in $b=bestFroms.
To to this, we'll first remove the placeholders that are empty in $b creating a temporary OnlyFroms (that will be 14221 lines shorter than $b)
(this will be done by concatenating ) undeFromToVoc after onlyFroms then sorting and removing all duplicated lines.

then let's concatenate the definitions from undeFromTo onto the end of OnlyFroms -- and then sort the result. It should replace the undefined words with their definitions from $s.


$ cat bestFroms undeFromToVoc|sort|uniq -u>OnlyFroms
$ wc $b undeFromToVoc OnlyFroms
  103304   103304 23708364 /home/ddailey/public_html/moby/mthes/bestFroms
   14221    14221   185167 undeFromToVoc
   89085    89084 23523205 OnlyFroms

$ cat OnlyFroms undeFromTos|sort>FromTos
$ wc FromTos
  103306   103306 27359899 FromTos


Why it grew by two words? I'm not quite sure. Probably an extra carriage return somewhere
$ cat FromTos |awk 'BEGIN {FS = ","} {print NF}'|sort|uniq -c
    927 1
     10 19
     23 20
     38 21
     55 22
     34 23
     46 24
 102173 25


















 
$


$ grep ,turnstile, super25|tr ',' '\n'|sort|uniq -c|sort -n grep ^turnstile super25|tr ',' '\n'|tac
14 tickerhatchway
   15 gatepostlintel
15 gatewayporch
15 hatchportal
15 hatchway
porte-cochere
  15 portalpostern
15 porte-cocherepropylaeum
15 propylaeumpylon
15 pylonscuttle
15 storm-doorside-door
15 telltalestile
16 thresholdstorm-door
17 gatethreshold
    20 traptollgate
21 tollgatetrap
21 trap-doortrap-door
21 turnpiketurnpike
36 turnstileturnstile:turnstile






















Involve one or more hyphen

$ sort -n thesvocfreq |head
      1 AA
      1 AA-radar
      1 abased
      1 abasing
      1 abboccato
      1 abductor
      1 Abelard
      1 A-board
      1 about-face
      1 above--board
[ddailey@granite mthes]$ sort -n thesvocfreq |tail
    645 close
    653 cast
    659 work
    678 line
    697 break
    699 check
    715 turn
    750 run
    919 set
   1120 cut
$ sort -n thesvocfreq|sed 's/\([0-9]\)\ .*/\1/'|tail
    645
    653
    659
    678
    697
    699
    715
    750
    919
   1120
$ sort -n thesvocfreq|sed 's/\([0-9]\)\ .*/\1/'|uniq -c|head -20
   8874       1
   6585       2
   5674       3
   4553       4
   3891       5
   3744       6
   3003       7
   3041       8
   3047       9
   2896      10
   2502      11
   2456      12
   2771      13
   2351      14
   2159      15
   2351      16
   1962      17
   1969      18
   1628      19
   1825      20
$ sort -n thesvocfreq|sed 's/\([0-9]\)\ .*/\1/'|uniq -c|tail -20
      2     553
      1     565
      1     579
      1     582
      1     588
      1     591
      1     595
      1     600
      1     631
      1     633
      1     645
      1     653
      1     659
      1     678
      1     697
      1     699
      1     715
      1     750
      1     919
      1    1120
$ head $v|awk '{print $2}'

$100--a--plate-dinner
3--D
a
A
aa
AA
AA-gun
AA-radar
aardvark

-------------
for i in `head $v|awk '{print $2}'`; do ./synhys $i ; done
very time consuming
-------------
leukemia 5
lot 55
master 92
measure 100
missilery 5
moment-of-truth 7
most 35
munitions 10
mushroom-cloud 4
musketry 5
nonpareil 31
number-one 6
oncoming 21
one-and-all 10
onset 25
ordnance 5
origin 30
origination 28
outbreak 22
outset 23
package-deal 57
^Xpackage 61
paragon 54
principal 38
prodigy 29
psychological-moment 7
psychological-warfare 1
^Cbasalt 15
bedrock 15
breccia 15
conglomerate 18
crag 15
gneiss 15
granite 15
lava 16
monolith 16
rubble 15
sandstone 15
saprolite 0
schist 15
scoria 15
scree 15
stone 18
degree 1
gun 4
missile 2
radar 1
rocket 2
six-shooter 0
gun 2
six-shooter 0
radar 1
animal 4
antelope 13
armadillo 12
bat 13
elephant 12
hare 13
horse 13
kangaroo 13
mammal 1
opossum 12
pig 13
platypus 12
rat 13
[ddailey@granite mthes]$ time for i in `head $v|awk '{print $2}'`; do ./synhys $i ; done|wc




Six minutes to do ten lines of $v:
$ time for i in `head $v|awk '{print $2}'`; do ./synhys $i ; done|wc
   5170   10340   62154

real    6m12.367s
user    5m37.862s
sys     0m32.183s

Therefore approximately sixty thousand minutes to do 100,000 lines of $v
one thousand hours of processing time.












$ head $w
a
aardvark
aback
abacus
abaft
abalienate
abalienation
abandon
abandoned
abandonment




Add dictionary definition of abacus

Subtract common words
From BNC use
echo $wf
$ head $wf
the 6187267
of 2941444
and 2682863
a 2126369
in 1812609
to 1620850
it 1089186
is 998389
was 923948
to 917579


../../data/wordstudy/BNCwordfreq63007
Examples:
$ sed = $wf|sed 'N;s/\n/\ /g'|sed -n 120,140p
120 get 69636
121 own 69459
122 does 68725
123 oh 68413
124 last 68063
125 no 67999
126 more 67198
127 going 64163
128 so 64028
129 after 62570
130 us 62350
131 government 62163
132 might 61446
133 same 61402
134 much 60838
135 see 60628
136 yes 60592
137 go 59772
138 make 59664
139 day 58863
140 man 58769

Define common words as top 130:
[ddailey@granite mthes]$ sed -n 1,130p $wf|awk '{print $1}'>common130
[ddailey@granite mthes]$ head common130
the
of
and
a
in
to
it
is
was
to
[ddailey@granite mthes]$ tail common130
own
does
oh
last
no
more
going
so
after
us
------------------
toward finding words that might have synonyms

$ w=~/public_html/words
$ echo $m
m=/home/ddailey/public_html/moby/mthes/mobythes.txt
$ b=~/public_html/data/wordstudy/BNCwordfreq63007
$ t=~/public_html/moby/mthes/thesvocab
$ s=~/public_html/moby/mthes/super25
$ d=~/public_html/moby/mthes/doubleSyns
$ wc $w $m $b $t $s $d
   35916    35916   332173 /home/ddailey/public_html/words
   30260   645505 24823017 /home/ddailey/public_html/moby/mthes/mobythes.txt
   63007   126014   767502 /home/ddailey/public_html/data/wordstudy/BNCwordfreq63007
  103306   103305  1186223 /home/ddailey/public_html/moby/mthes/thesvocab
  102146   102146 26810107 /home/ddailey/public_html/moby/mthes/super25
 2306753  4510198 42414932 /home/ddailey/public_html/moby/mthes/doubleSyns
 2641388  5523084 96333954 total

Consider finding words that are undefined in either thesvocab or super25 and of those, seeing which have definitions in either the other PD thesaurus or in Webster's 1913.
Interesting time comparisons: (awk is best!)

$ time cat super25|grep -o ^.*:|wc
 102146  102146 1273678

real    0m1.767s
user    0m1.706s
sys     0m0.033s
$ time cat super25|sed s/:.*//|wc
 102146  102146 1171532

real    0m0.970s
user    0m0.904s
sys     0m0.032s

$ time cat super25|awk 'BEGIN {FS = ":"} {print $1}'|wc
 102146  102146 1171532

real    0m0.278s
user    0m0.224s
sys     0m0.045s

$ time cat super25|awk 'BEGIN {FS = ":"} {print $1}'>s25voc

real    0m0.218s
user    0m0.167s
sys     0m0.050s

Consider those words in thesvocab that are not defined in s25voc

$ comm -13 s25voc thesvocab|wc
   1160    1159   14691
$ comm -13 s25voc thesvocab>unsyns25
[ddailey@granite mthes]$ wc unsyns25
 1160  1159 14691 unsyns25

Most of these involve hyphens
$ grep -c "-" unsyns25
882
Of those, most appear to be problems with hyphenations -- phrases hypenated in one place and not in another

228 words remain
$ grep -vc "-" unsyns25
278


$ grep -v "-" unsyns25|head

acoustical
airburst
alphabetical
amebic
anabolism
androgyny
arthritis
banzai
barbecued
$ grep -v "-" unsyns25|tail
zoril
zoster
zoysia
zucchetto
zwieback
zygosis
zygospore
zygote
zymotic
zymurgy

$ grep ^banzai super25|wc
      1       1     283
[ddailey@granite mthes]$ grep ^banzai super25
banzai-attack:unprovoked-assault,strike,sortie,shock-tactics,sally,rush,run-at,run-against,push,panzer-warfare,overkill,onslaught,onset,offensive,offense,mugging,megadeath,mass-attack,lightning-war,lightning-attack,infiltration,head--on-attack,gas-attack,frontal-attack,flank-attack
[ddailey@granite mthes]$ grep ^banzai thesvocab
banzai
banzai-attack

$ grep ^zucchetto super25
[ddailey@granite mthes]$ grep zucchetto super25|wc
      7       7    1517


$ grep banzai, hyph-thes
banzai,bugle-call,call-to-arms,call--up,catchword,clarion,clarion-call,conscription,exhortation,go-for-broke,gung-ho,levy,mobilization,muster,rally,rallying-cry,rebel-yell,recruitment,slogan,trumpet-call,war-cry,war-whoop,watchword
$ grep banzai, hyph-thes|wc
      1       1     232

$ grep ^banzai, mobythes.txt
banzai,bugle call,call to arms,call-up,catchword,clarion,clarion call,conscription,exhortation,go for broke,gung ho,levy,mobilization,muster,rally,rallying cry,rebel yell,recruitment,slogan,trumpet call,war cry,war whoop,watchword

[ddailey@granite mthes]$ grep ^banzai, mobythes.txt|wc
      1      13     231
[ddailey@granite mthes]$ grep ,banzai, mobythes.txt|wc
      0       0       0

-----------------topic ---------------
For comparing time and overlap between original and secondary word lists -- part 1 : time

$ cat ssup
ln=10
b=/home/ddailey/public_html/moby/mthes/bestFroms
h=/home/ddailey/public_html/moby/mthes/hyph-thes
m=/home/ddailey/public_html/moby/mthes/mobythes.txt
s=/home/ddailey/public_html/moby/mthes/super25
echo
echo "-------------"$h $1
grep ,$1, $h|tr ',' '\n'|sort|uniq -c|sort -nr|head -$ln|awk '{print $2,$1}'|tr '\n' ','
echo
echo "-------------"$s $1
grep ,$1, $s|tr ',' '\n'|sort|uniq -c|sort -nr|head -$ln|awk '{print $2,$1}'|tr '\n' ','
echo
echo "-------------"$b $1
grep ,$1, $b|tr ',' '\n'|sort|uniq -c|sort -nr|head -$ln|awk '{print $2,$1}'|tr '\n' ','


-------------/home/ddailey/public_html/moby/mthes/hyph-thes propitiate
propitiate 34,appease 23,placate 20,mollify 20,conciliate 20,soothe 18,pacify 18,calm 18,allay 18,tranquilize 16,
-------------/home/ddailey/public_html/moby/mthes/super25 propitiate
propitiate 15,satisfy 11,smooth-over 8,smooth-down 8,smooth 8,appease 8,square-it 7,square 7,soothe 7,repair 7,
-------------/home/ddailey/public_html/moby/mthes/bestFroms propitiate
propitiate 20,satisfy 13,tranquilize 11,soothe 11,smooth-over 11,smooth-down 11,smooth 11,appease 11,square-it 9,square 9,
-------------/home/ddailey/public_html/moby/mthes/hyph-thes dispense
dispense 108,deal-out 73,issue 71,give-out 62,give 60,deal 60,dole-out 57,mete-out 56,administer 55,mete 53,
-------------/home/ddailey/public_html/moby/mthes/super25 dispense
dispense 49,issue 40,give 30,deal 30,give-out 29,deal-out 27,present 26,yield 25,mete-out 25,dole 25,
-------------/home/ddailey/public_html/moby/mthes/bestFroms dispense
dispense 33,issue 29,deal 24,mete-out 22,give-out 22,give 22,dole 20,yield 18,mete 18,deal-out 17,

....
-------------/home/ddailey/public_html/moby/mthes/super25 instigator
instigator 54,maker 38,prime-mover 37,originator 37,mother 37,producer 36,sire 34,precursor 34,introducer 32,master 26,
-------------/home/ddailey/public_html/moby/mthes/bestFroms instigator
instigator 37,producer 20,prime-mover 20,precursor 20,originator 20,mother 20,maker 20,sire 19,introducer 19,catalyst 18,
-------------/home/ddailey/public_html/moby/mthes/hyph-thes authenticate
confirm 52,authenticate 52,validate 51,warrant 50,support 50,certify 50,affirm 49,ratify 48,attest 37,back 36,
-------------/home/ddailey/public_html/moby/mthes/super25 authenticate
authenticate 19,validate 18,support 18,warrant 17,confirm 17,certify 17,affirm 17,ratify 16,attest 14,back 13,
-------------/home/ddailey/public_html/moby/mthes/bestFroms authenticate
warrant 10,validate 10,support 10,confirm 10,authenticate 10,certify 9,back 9,uphold 8,verify 7,ratify 7,
real    0m2.247s
user    0m1.872s
sys     0m0.341s


10 words -- three processes 10 synonyms -- 2.2 seconds

10 words --two processes 20 synonyms --  1.8 seconds

10 words --three  processes 20 synonyms --  3.6 seconds
100 words --three processes 20 synonyms --  19 seconds



-----------------topic ---------------
comparing overlap (part 2) between different approaches ($h $s $b)
comm -23 <(grep ,$1, $h|sed s/,.*//|sort) <(grep ^$1, $h|tr , "\n"|sort)
$ grep ,illumination, $h|tr ',' '\n'|sort|uniq -c|sort -nr|head -$ln|awk '{print $2,$1}'|tr '\n' ',';echo
illumination 98,illustration 50,enlightenment 41,representation 37,image 33,fresco 33,copy 32,tableau 31,stencil 31,picture 31,
$ grep ,illumination, $b|tr ',' '\n'|sort|uniq -c|sort -nr|head -$ln|awk '{print $2,$1}'|tr '\n' ',';echo
illumination 119,enlightenment 59,illustration 48,wash 25,strip-lighting 25,stage-lighting 25,spot-lighting 25,representation 25,radiation 25,irradiation 25,

$ ob=`grep ,illumination, $b|tr ',' '\n'|sort|uniq -c|sort -nr|head -20|awk '{print $2}'|sort`
$ oh=`grep ,illumination, $h|tr ',' '\n'|sort|uniq -c|sort -nr|head -20|awk '{print $2}'|sort`

$ comm -3 <(echo $ob|tr ' ' '\n') <(echo $oh|tr ' ' '\n')
        abstract
        abstraction
        copy
        cyclorama
        daub
        engraving
explanation
        fresco
        image
indirect-lighting
information
instruction
irradiation
light-and-shade
        likeness
        miniature
overhead-lighting
        panorama
        photograph
        picture
radiation
spoon--feeding
spot-lighting
stage-lighting
        stencil
strip-lighting
        tableau
teaching
tutelage
tutorage
        wall-painting
wash

]$ ./oversup illumination
-------------b and s illumination
16
-------------b not s
4
-------------not b but s
4
-------------s and h illumination
4
-------------s not h illumination
16
-------------not s but h illumination
16
[ddailey@granite cgi]$ ./oversup hard
-------------b and s hard
16
-------------b not s
4
-------------not b but s
4
-------------s and h hard
15
-------------s not h hard
5
-------------not s but h hard
5

b not s = not b but s = 20 - (s and b)
so simplify program to do two passes rather than  six!
$ cat oversup
ln=20
h=/home/ddailey/public_html/moby/mthes/hyph-thes
s=/home/ddailey/public_html/moby/mthes/super25
b=/home/ddailey/public_html/moby/mthes/bestFroms
oh=`grep ,$1, $h|tr ',' '\n'|sort|uniq -c|sort -nr|head -$ln|awk '{print $2}'|s$
os=`grep ,$1, $s|tr ',' '\n'|sort|uniq -c|sort -nr|head -$ln|awk '{print $2}'|s$
ob=`grep ,$1, $b|tr ',' '\n'|sort|uniq -c|sort -nr|head -$ln|awk '{print $2}'|s$
bs=`comm -12 <(echo $ob|tr ' ' '\n') <(echo $os|tr ' ' '\n')|wc -l`
sh=`comm -12 <(echo $os|tr ' ' '\n') <(echo $oh|tr ' ' '\n')|wc -l`
bh=`comm -12 <(echo $ob|tr ' ' '\n') <(echo $oh|tr ' ' '\n')|wc -l`

echo -e $1 "\ts^h:\t" $sh "\tb^h:\t" $bh "\tb^s:\t" $bs



$ time for i in `head ../moby/mthes/ranh100`; do ./oversup $i; done
propitiate      s^h:     7      b^h:     8      b^s:     18
dispense        s^h:     14     b^h:     14     b^s:     19
conduit         s^h:     18     b^h:     15     b^s:     17
triangle        s^h:     15     b^h:     15     b^s:     20
disseminated    s^h:     15     b^h:     15     b^s:     20
fee-simple      s^h:     6      b^h:     6      b^s:     20
wholesale       s^h:     7      b^h:     3      b^s:     11
amnesia         s^h:     18     b^h:     18     b^s:     20
instigator      s^h:     17     b^h:     9      b^s:     10
authenticate    s^h:     17     b^h:     12     b^s:     15

real    0m2.743s
user    0m2.069s
sys     0m0.485s

$ time for i in `cat ../moby/mthes/ranh100`; do ./oversup $i; done>comparisonHSB

real    0m22.653s
user    0m16.923s
sys     0m4.171s

H and SH and BS and B
7818propitiate
141419dispense
181517conduit
151520triangle
151520disseminated
6620fee-simple
7311wholesale
181820amnesia
17910instigator
171215authenticate
111boo-boo
10612sight
181820swearing
161719back-country
161917Goshen
10616stately
121714Maoism
11411trial-balloon
4216upward
7713whirl
181819incensed
131118punishing
1115boot
181820plutocrat
141419hit-man
171416outright
3318wiggle
3317graphic-arts
171918learned-man
111116plainness
171618clemency
1019fey
181618imagine
2220sated
181717parabolic
151216contraposition
121319wrestle
2118aggregate
6417impropriety
6620bring-before
181618unaccomplished
191920sellout
161416biased
51111wagon
141318hew
131417ensue
9917glory
161214arena
11714sweet
10911coordinate
161620shrieking
121316unquestioned
7616sweet-talk
191919juiciness
151014visitor
91110torment
111fine-grained
16911grateful
1137harmonic
7311spectrum
1537peaceable
121015dodge
8814stonecutter
111018ban
181918inaudible
2214consumer
2218blameworthy
3320evangelistic
15610unwavering
1698double-agent
4420undiscerning
10818infliction
181719multiplicity
5416preacher
151418attire
181919matriarchal
3118cop
161519go-before
1428bureaucratic
181210Frigg
11612attest
161418ladle
10917province
181918economizing
1657disclosed
192019extant
111017fork-over
4416illumination
191920split-second
1118holdover
3316pertain
9613swamp
8919expurgate
3218foul-play
181919turnover
3716rite
111017fulfillment
121219valuate
0020deaf-mute
181919idol-worship
11289881593







-----------------topic ---------------
Relations of words to words

Words relate to one another in many ways. Those relations have differential effects on those people who perceive the words and their relationships.
Creativity Research Journal
ISSN: 1040-0419 (Print) 1532-6934 (Online) Journal homepage: http://www.tandfonline.com/loi/hcrj20
Cognitive Complexity in the Remote Association Test - Chinese Version
Su-Pin Hung, Po-Sheng Huang & Hsueh-Chih Chen

Dailey, D.P. (1978). An analysis and evaluation of theinternal validity of the remote associates test: What does it measure? Educational & Psychological Measurement, 38, 1031–1040. doi:10.1177/001316447803800421

Davies, Mark. (2011) N-grams data from the Corpus of Contemporary American English (COCA). Downloaded from http://www.ngrams.info on January 11, 2017.

http://www.ngrams.info.

$ awk '$3 == "hap" {sum += $1; print $0} END {print sum}' w2_.txt
27      and     hap
31      the     hap
24      what    hap
82

$ awk '$2 == "fragrant" {sum += $1; print $0} END {print sum}' w2_.txt
96      fragrant        and
42      fragrant        flowers
42      fragrant        white
108     fragrant        with
288

get 100 good words from WordsInManyPlaces:
$ for i in `seq 11 15`;do shuf -n 20 <(grep ^[[:space:]]*$i WordsInManyPlaces); done
     11 cozenage
     11 melon
     11 vantage

$ for i in `seq 11 15`;do shuf -n 20 <(grep ^[[:space:]]*$i WordsInManyPlaces); done|awk '{print $2}'|sort>ranGoodWords
[ddailey@daileyproject-srunet-sruad-edu word]$ head ranGoodWords
abaft
airy
bombardment
booth
bout
burst
coldness
conscience
contribution
conviction

awk '$3 == "very" {sum += $1; print $0} END {print sum}' w2_.txt
78972   a       very
164     about   very
26      across  very
57      act     very
62      acted   very
81      acting  very
341     actually        very


$ awk '$3 == v {sum += $1} END {print v,sum} ' v="very" w2_.txt
very 353982

$ for i in `head ranGoodWords`
> do
> awk '$3 == v {sum += $1} END {print v,sum} ' v=$i w2_.txt
> done
abaft
airy 647
bombardment 462
booth 3664
bout 678
burst 4929
coldness 169
conscience 4199
contribution 6020
conviction 4889

$ grep abaft w2_.txt
[none]




tos and froms
Those at left are in third column (hence being linked to);
those at right are in 2nd and being linked from.


$ time for i in `cat ranGoodWords`; do awk '$3 == v {sum += $1} END {print v,sum} ' v=$i w2_.txt; done$ time for i in `cat ranGoodWords`; do awk '$2 == v {sum += $1} END {print v,sum} ' v=$i w2_.txt; done

abaft
airy 647
bombardment 462
booth 3664
bout 678
burst 4929
coldness 169
conscience 4199
contribution 6020
conviction 4889
corduroy 159
decay 1518
delineate 269
demerit
abaft
airy 98
bombardment 377
booth 1810
bout 1574
burst 7172
coldness 135
conscience 1720
contribution 7175
conviction 3904
corduroy 158
decay 966
delineate 206
demerit
vociferous 161
voluptuousness
vouch 194
waft 51
watering 646
whereby 430
whip 2708
yank 409
vociferous 53
voluptuousness
vouch 316
waft 46
watering 882
whereby 1058
whip 1824
yank 407
real    0m35.144s
user    0m34.433s
sys     0m0.655s
real    0m30.087s
user    0m29.400s
sys     0m0.645s


$ awk '$3 == "airy" {sum += $1; print $0} END {print sum}' w2_.txt
202     an      airy
228     and     airy
23      of      airy
167     the     airy
27      with    airy
647
[ddailey@daileyproject-srunet-sruad-edu word]$ awk '$2 == "airy" {sum += $1; print $0} END {print sum}' w2_.txt
67      airy    and
31      airy    disk
98

[airy disk?]
An Airy disk is the central bright circular region of the pattern produced by light diffracted when passing through a small circular aperture.

$ awk '$2 == "viola" {sum += $1; print $0} END {print sum}' w2_.txt
49      viola   and
32      viola   frey
23      viola   said
104

Looking at all the violas for one violas

[ddailey@daileyproject-srunet-sruad-edu word]$ awk '$3 == "viola" {sum += $1; print $0} END {print sum}' w2_.txt
72      and     viola
33      frank   viola
65      the     viola
33      to      viola
203

]

$ cat w2_.txt |awk '{print $2;print $3}'|sort|uniq -c>W2Voc
[ddailey@daileyproject-srunet-sruad-edu word]$ wc W2Voc
  68784  137568 1153655 W2Voc

Looking at all the violas for all violas
$ head -17000 w2_.txt |awk '(NR>1) && ($2!=p){print p, s; s=0} {p=$2; s+=$1} END{print p, s}'
a 8586383

$ head -17500 w2_.txt |awk '(NR>1) && ($2!=p){print p, s; s=0} {p=$2; s+=$1} END{print p, s}'
a 8938942
]$ head -17700 w2_.txt |awk '(NR>1) && ($2!=p){print p, s; s=0} {p=$2; s+=$1} END{print p, s}'
a 9054108
a&amp;m 55
a-j 24
a-line 25
a-national 36
a-struck 28
a-wim 38
a. 1626
a.2d 37
a.c 154
a.d 57
a.g 95
a.j 495
a.k.a. 24
a.m 52

85      a       zoom
23      a       zucchini
77      a       zulu
9054108
[ddailey@daileyproject-srunet-sruad-edu word]$ awk '$2 == "a" {sum += $1; print $0} END {print sum}' w2_.txt
$ awk '$2 == "a" {sum += $1; print $0} END {print sum}' w2_.txt|wc
  17654   52960  254120


More than 17,000 occurrences of 'a' in 2-ngrams

Time comparision of cat file|awk vs awk file
]$ time awk '(NR>1) && ($2!=p){print p, s; s=0} {p=$2; s+=$1} END{print p, s}' w2_.txt|wc
  47803   95606  603472

real    0m1.267s
user    0m1.239s
sys     0m0.026s
[ddailey@daileyproject-srunet-sruad-edu word]$ time cat w2_.txt |awk '(NR>1) && ($2!=p){print p, s; s=0} {p=$2; s+=$1} END{print p, s}'|wc
  47803   95606  603472

real    0m1.283s
user    0m1.254s
sys     0m0.026s

$ time cat w2_.txt |awk '(NR>1) && ($2!=p){print s, p; s=0} {p=$2; s+=$1} END{print s,p}'>w2FromFreq

real    0m1.287s
user    0m1.262s
sys     0m0.024s

$ shuf -n 10 w2FromFreq
122 savannah
1102 kosovo
27 georg
22204 grow
44 unchecked
5773 versions
78 bossa
51 barro
35 steerer
25 summarised
[ddailey@daileyproject-srunet-sruad-edu word]$ wc w2FromFreq
 47803  95606 603472 w2FromFreq
$ cat w2_.txt|sort -k3 -k2>w2ToSort (sorting first on field 3 then 2)

$ more w2ToSort
275     a       a
119     abandon a
116     abandoned       a
74      abandoning      a
28      abetting        a
39      abhors  a
786     aboard  a
42      abort   a
60      abortion        a


$ time cat w2ToSort |awk '(NR>1) && ($3!=p){print s, p; s=0} {p=$3; s+=$1} END{print s,p}'>w2ToFreq

real    0m1.166s
user    0m1.145s
sys     0m0.021s
[ddailey@daileyproject-srunet-sruad-edu word]$ wc w2ToSort w2ToFreq
 1020385  3061155 16660623 w2ToSort
   62023   124046   777318 w2ToFreq
$ head w2ToFreq
8156487 a
2200 a.
64 a*
57 a+
109 a1
35 a-10
31 a-12
59 a2
72 a380
64 a4
[ddailey@daileyproject-srunet-sruad-edu word]$ head w2FromFreq
9054108 a
55 a&amp;m
24 a-j
25 a-line
36 a-national
28 a-struck
38 a-wim
1626 a.
37 a.2d
154 a.c

$ more W2Voc
  23833 a
     63 a.
      1 a*
      1 a+
      3 a1
      1 a-10
      1 a-12
      1 a2
      1 a.2d
      1 a380



$ awk '{print $2}' w2FromFreq>w2FromVoc
[ddailey@daileyproject-srunet-sruad-edu word]$ awk '{print $2}' w2ToFreq>w2ToVoc
[ddailey@daileyproject-srunet-sruad-edu word]$ wc w2FromVoc w2ToVoc
 47803  47803 418854 w2FromVoc
 62023  62023 541204 w2ToVoc
109826 109826 960058 total
$ wc W2Words
 68784  68784 603383 W2Words
common to to and from vocabularies:

$ comm -12 w2FromVoc w2ToVoc |wc
  41002   41002  356347
$ comm -13 w2FromVoc w2ToVoc |wc
  21021   21021  184857
[ddailey@daileyproject-srunet-sruad-edu word]$ comm -23 w2FromVoc w2ToVoc |wc
   6801    6801   62507

$ shuf -n 10 <(comm -13 w2FromVoc w2ToVoc)
houstonians
fischman
pols
spratly
brazier
wanderer
catheterization
mccullers
randomisation
moderns

$ grep brazier w2_.txt
28      a       brazier
29      the     brazier
$ grep wanderer w2_.txt
74      a       wanderer
105     the     wanderer
$ grep mccullers w2_.txt
64      carson  mccullers





[ddailey@daileyproject-srunet-sruad-edu word]$ shuf -n 10 <(comm -23 w2FromVoc w2ToVoc)
petticoats
editorial-page
jojoba
frankincense
perfectibility
focussed
nonsettling
forfeiting
enthused
parceling

$ grep jojoba w2_.txt
23      jojoba  oil
$ grep frankincense w2_.txt
46      frankincense    and


$ head w2FromFreq w2ToFreq W2Voc
==> w2FromFreq <==
9054108 a
55 a&amp;m
24 a-j
25 a-line
36 a-national
28 a-struck
38 a-wim
1626 a.
37 a.2d
154 a.c

==> w2ToFreq <==
8156487 a
2200 a.
64 a*
57 a+
109 a1
35 a-10
31 a-12
59 a2
72 a380
64 a4

==> W2Voc <==
  23833 a
     63 a.
      1 a*
      1 a+
      3 a1
      1 a-10
      1 a-12
      1 a2
      1 a.2d
      1 a380

$ head w2FromFreq>temp1
$ sort temp1>temp
$ cp temp temp1

$ head w2ToFreq>temp2
temp2>temp
$ cp temp temp2


$ head W2Voc>temp3

$ join -j2 temp1 temp3
a 9054108 23833
a. 1626 63
a.2d 37 1
[ddailey@daileyproject-srunet-sruad-edu word]$ join -j2 temp2 temp3
a 8156487 23833
a. 2200 63
a* 64 1
a+ 57 1
a1 109 3
a-10 35 1
a-12 31 1
a2 59 1
a380 72 1
[ddailey@daileyproject-srunet-sruad-edu word]$ join -j2 temp2 temp3 temp1
join: invalid field number: ‘temp2’

[ddailey@daileyproject-srunet-sruad-edu word]$ join -j2 temp1 temp2
a 9054108 8156487
a. 1626 2200

$ sort -k2 w2FromFreq >temp        [ddailey@daileyproject-srunet-sruad-edu word]$ join -j2 temp w2ToFreq>W2FromTo  [ddailey@daileyproject-srunet-sruad-edu word]$ cp temp w2FromFreq
[ddailey@daileyproject-srunet-sruad-edu word]$ join -j2 w2FromFreq w2ToFreq>W2FromTo
[ddailey@daileyproject-srunet-sruad-edu word]$ wc W2FromTo
 41037 123113 690320 W2FromTo

$ awk '{$4=($2/($3+1))} {print $0}' W2FromTo|sort -k4|head -20
corp 25 4958 0.00504134
chol 25 1868 0.0133761
obama 340 23923 0.0142117
thats 69 4823 0.0143035
gosh 24 1643 0.0145985
brinkley 163 10509 0.015509
jarriel 29 1631 0.0177696
stossel 60 2904 0.020654
gigantic 38 1762 0.0215542
dolan 31 1254 0.0247012
aforementioned 27 1066 0.0253046
oversized 28 1051 0.026616
normative 32 1182 0.0270499
manhattan 270 9479 0.028481

$ awk '{$4=($2/($3+1))} {print $0}' W2FromTo|sort -k4|tail -20
spokesmen 253 27 9.03571
marveling 317 34 9.05714
pieced 327 35 9.08333
kg 258 27 9.21429
leafing 250 26 9.25926
fearing 1492 160 9.26708
co-sponsored 232 24 9.28
depending 12376 1327 9.31928
spooked 224 23 9.33333
sp 1234 131 9.34848
liane 2571 26 95.2222
truckloads 229 23 9.54167
racking 258 26 9.55556
p.m 10001 1044 9.57033
salman 230 23 9.58333
entwined 278 28 9.58621
amidst 1023 105 9.65094

$ awk '{$4=($2/($3+1))} {print $0}' W2FromTo|sort -k4|grep amidst
amidst 1023 105 9.65094
$ awk '{$4=($2/($3+1))} {print $0}' W2FromTo|sort -k4|grep ^amid\
amid 5070 926 5.46926

$ awk '{$4=($2/($3+1))} {print $0}' W2FromTo|sort -k4|grep ^amongst
amongst 1694 168 10.0237

$ awk '{$4=($2/($3+1))} {print $0}' W2FromTo|sort -k4|grep ^among\
among 124865 78473 1.59116
$ head -40 W2Voc
$ head -40 W2FromTo$ join -2 1 -1 2 <(head -40 W2Voc) <(head -40 W2FromTo)

23833 a
     63 a.
      1 a*
      1 a+
      3 a1
      1 a-10
...
     20 aa
     12 aaa
      ...
      4 a&amp;m
     ...
      1 aardwolf
     56 aaron
      6 aarp
      1 aas
      1 aasa
      1 aasl
      1 aasp
      3 aau
     ...
a 9054108 8156487
a. 1626 2200
a1 27 109
aa 534 970
aaa 114 810
a&amp;m 55 320
aaron 2534 2006
aarp 63 379
aau 29 61
ab 119 341
aback 444 969
abandon 3540 4212
...
a 23833 9054108 8156487
a. 63 1626 2200
a1 3 27 109
aa 20 534 970
aaa 12 114 810
a&amp;m 4 55 320
aaron 56 2534 2006
aarp 6 63 379
aau 3 29 61

Calculating standard deviation of  seq 1 20:
$ awk '{for(i=1;i<=NF;i++) {sum[i] += $i; sumsq[i] += ($i)^2}}
>           END {for (i=1;i<=NF;i++) {
>           printf "%f %f \n", sum[i]/NR, sqrt((sumsq[i]-sum[i]^2/NR)/NR)}
>          }' <(seq 1 20)
10.500000 5.766281


$ awk '{for(i=1;i<=NF;i++) {sum+= $i; sumsq[i] += ($i)^2}}
          END {for (i=1;i<=NF;i++) {
          printf "%f %f \n", sum/NR, sqrt((sumsq[i]-sum^2/NR)/NR)}
         }' <(seq 1 20)
10.500000 5.766281

$ awk '{for(i=1;i<=NF;i++) {sum+= $i; sumsq += ($i)^2}}
          END {for (i=1;i<=NF;i++) {
          printf "%f %f \n", sum/NR, sqrt((sumsq-sum^2/NR)/NR)}
         }' <(seq 1 20)
10.500000 5.766281
$ awk '{for(i=1;i<=NF;i++) {sum+= $i; sumsq += ($i)^2}}
          END  {
          printf "%f %f \n", sum/NR, sqrt((sumsq-sum^2/NR)/NR)}
         ' <(seq 1 20)
10.500000 5.766281




















-----------------topic ---------------

-----------------topic ---------------

-----------------topic ---------------