Rhythms of the
language
Alphabets,
syllabaries,
idiographies – the choice of a writing system may be influenced by a
language’s
cadence.
The choice of how a
language
invents a Pig Latin may as well. [1]
[2]
A language like
Hawaiian (like Cherokee and Inupiaq) would be well
suited for the development of a syllabary, instead of an alphabet.
Why?: agglutination (rather than inflection or isolation) few
consonants and even fewer consonant clusters.
Consider the
following:
forty 5
ghost 5
gipsy 5
glory 5
mopsy 5
almost 6
begirt 6
biopsy 6
chintz 6
dehort 6
What do they have in common?
Or... how about these?:
inkier
and purply (hint:
,5,-3,-2,-4,13)
Outline:
About
letters in alphabetical or inverse alphbetical order
Letters
in alphabetical or inverse alphbetical order : Examples
formalisms
and probabilities
Rhythms
of alphabetic ordering for
artificial and actual words
Words
of a given rhythm
Letter
gap differentials
A
different kind of rhyme: Words with rhyming gap differentials
More
"standard" (acoustically obvious) rhythms
Dividing
characters into
Consonants and Vowels
Consononant
Vowel rhythms in English, Spanish, French and German vocabulary
pretty
picture
space
On
probabiliities of monotonic
(and other) letter sequences:
Motivation: there are
more words
whose letters are in alphabetical order than whose letters are in
inverse
alphabetical order:
#(alpha order)
$ cat $w|awk '{c=0; for (i=1; i<=NF; ++i) { if ($i <
$(i+1)) c++ }; if
(c>NF-2) print $0,NF }' FS=""|sort -nk2|wc
212
424
1362
#(inverse alpha order)
$ cat $w|awk '{c=0; for (i=1; i<=NF; ++i) { if ($i >
$(i+1)) c++ }; if
(c>NF-1) print $0,NF }' FS=""|sort -nk2|wc
145
290
914
Examples:
$
cat $w|awk '{c=0; for (i=1; i<=NF; ++i) { if ($i <
$(i+1)) c++ }; if
(c>NF-2) print $0,NF }' FS=""|sort -nk2|tail |
$
cat $w|awk '{c=0; for (i=1; i<=NF; ++i) { if ($i >
$(i+1)) c++ }; if
(c>NF-1) print $0,NF }' FS=""|sort -nk2|tail |
forty
5
ghost 5
gipsy 5
glory 5
mopsy 5
almost 6
begirt 6
biopsy 6
chintz 6
dehort 6 |
polka
5
solid 5
sonic 5
spoke 5
theca 5
tonic 5
unfed 5
wrong 5
sponge 6
vomica 6 |
This observation led to an investigation of “lexical letter rhythms,” as well as curiosity about
a) whether the above
points to some “preference” of monotonically
increasing
sequences, or simply to the possibility that more English words begin
with
letters early in the alphabet, hence making increasing sequences more
probably
b) whether the rhythms of
monotonicity in letter sequences favor certain patterns more than others
c) the extent to which all of this
can be explained by pure randomness.
Let α∈ {a..z}* with |α|=2
and α =a1a2.
(In English, this just means let the symbol alpha refer to a string of
two lowercase letters (a1 and a2) from the
English alphabet.)
Let us write a1 < a2
to mean that a1 is alphabetically
prior to a2 .
If α is chosen at random from {a..z}*, then P(a1
= a2) = 1/26
and P(a1 < a2)= ½
(25/26) ≈ .48 .
In actuality, of the 43 two letter words in w$:
$ egrep ^[a-z]{2}$ $w
ah
am
an
as
at |
ax
ay
be
bo
by |
do
em
en
ex
fa |
go
ha
he
id
if |
in
is
it
la
lo |
me
mi
my
no
of |
oh
on
or
os
ox |
pi
re
so
to
up |
us
we
ye |
$ egrep ^[a-z]{2}$ $w|wc
43
43 129
24 of them have a1 < a2, while the other 19 have a1
> a2
. This is not likely outside the expectations of chance.
For longer words, though, the situation is more complex. Let’s consider
three letter sequences, both English words and nonwords.
For arbitrary letter sequences , α∈ {a..z}* with |α|=2
and α
=a1a2 … an ,we call a
letter sequence monotonic increasing if ai
< aj for all i and j less than
n+1. It is monotonic
nondecreasing if ∀ i,j ai ≤ aj
.
Examples:
- α =abc is monotonic increasing, but is not a word.
- his is a monotonic increasing word.
- accent is a nondecreasing word
- zone is a decreasing word.
- yucca is a nonincreasing word.
Rhythms of alphabetic ordering
for artificial and actual words
-----------------------
Four letter words
-- what are the most common rhythms?
sampling real words:
$ egrep ^.{4}$ $w|awk '{for (i=1;i<NF;i++) {if
($i>$(i+1))s=s""0;else if ($i<$(i+1))s=s""2;else
s=s""1};{print s" "$0;s="";}} ' FS=""|awk '{print $2,$1}'|sort -k2|uniq
-cf1|sort –n
sampling artificial sequences
$ shuf -ern 8000 {a..z}|xargs -L 4|sed 's/\ //g'|awk '{for
(i=1;i<NF;i++) {if ($i>$(i+1))s=s""0;else if
($i<$(i+1))s=s""2;else s=s""1};{print s" "$0;s="";}} ' FS=""|awk
'{print $2,$1}'|sort -k2|uniq -cf1|sort -n|tail -18
10 blls
212 1 eery 122
11 bbcz
122 1 ooze 120
11 hhfd
100 8 miff 001
16 agqq
221 19 bell 221
18 dccx
012 21 ally 212
22 aame
120 23 feed 010
24 ddbx
102 27 abba 210
25 ajjh
210 38 eddy 012
25 cabb
021 39 biff 201
28 amhh
201 47 ball 021
64 abcy
222 50 life 000
72 hfea
000 63 abet 222
197 bafn
022 174 able 220
205 ecbd
002 190 aged 200
206 abqj
220 202 fear 002
222 amja
200 248 babe 022
408 bazq
020 365 afar 202
417 aeaf
202 475 bake 020
The above data
represent rhythm frequencies, example words, and actual
rhythm patterns
Down up down (020) as in the word "bake" is most frequent (475 out of
the 1991 four letter words in this resource (FRELI))
Second most common rhythm pattern, 202, as in "afar" with this rhythm
present in 365 of the words.
Six letter words,
(imaginary and real), and example rhythms
$ paste <(shuf -ern 25000 {a..z}|xargs -L 6|sed 's/\ //g'|awk
'{for (i=1;i<NF;i++) {if ($i>$(i+1))s=s""0;else if
($i<$(i+1))s=s""2;else s=s""1};{print s" "$0;s="";}} ' FS=""|awk
'{print $2,$1}'|sort -k2|uniq -cf1|sort -n|tail -20) <(egrep
^.{6}$ $w|awk '{for (i=1;i<NF;i++) {if
($i>$(i+1))s=s""0;else if ($i<$(i+1))s=s""2;else
s=s""1};{print s" "$0;s="";}} ' FS=""|awk '{print $2,$1}'|sort -k2|uniq
-cf1|sort -n|tail -20)
64 gazfec
02000 65 adagio 20222
70 acadpw
20222 66 health 00220
78 aglbgo
22022 67 backup 02220
78 awgfck
20002 69 abduce 22202
90 ihcxut
00200 101 amical 20002
101 dabnxd
02220 108 abrade 22022
115 atrauv
20022 108 cajole 02200
122 bamuih
02200 109 ballad 02102
124 abriet
22002 134 abased 20200
153 ebaltp
00220 146 abacus 20220
157 bahehv
02022 148 abject 22002
169 abnfwk
22020 148 alight 20022
171 akauob
20200 176 ablate 22020
189 asnlol
20020 181 afeard 20020
193 baqrbe
02202 183 featly 00202
199 gdayfp
00202 237 backer 02202
213 acadvq
20220 254 bakery 02022
231 caztov
02002 270 banger 02002
293 cawlnc
02020 317 agency 20202
304 abapcl
20202 346 balize 02020
words
of a given rhythm
02102 (ballad)
Compare its frequency (109 words out of 4321 six letter words) with the
following based on a similar count ($ echo "4321 * 6"|bc = 25926) of
six letter random words:
$ shuf -ern 25926 {a..z}|xargs -L 6|sed 's/\ //g'|awk '{for
(i=1;i<NF;i++) {if ($i>$(i+1))s=s""0;else if
($i<$(i+1))s=s""2;else s=s""1};{print s" "$0;s="";}} ' FS=""|awk
'{print $2,$1}'|sort -k2|uniq -cf1|sort -n|grep 02102
11 ihvviw 02102
Such a rhythm is much more likely in English than in random letter
sequences
Here's how to find all the words of that given rhythm:
$
egrep ^.{6}$ $w|awk '{for (i=1;i<NF;i++) {if
($i>$(i+1))s=s""0;else if ($i<$(i+1))s=s""2;else
s=s""1};{print
s" "$0;s="";}} ' FS=""|awk '{print $2,$1}'|grep 02102|xargs -L6
ballad 02102 ballet 02102 banner 02102 barrel 02102 barren 02102 barrow
02102
basset 02102 batten 02102 batter 02102 caller 02102 capper 02102 carrot
02102
dagger 02102 dapper 02102 fallen 02102 farrow 02102 fatten 02102 fellah
02102
fennel 02102 ferret 02102 fetter 02102 gaffer 02102 galley 02102 gammer
02102
garret 02102 hammer 02102 happen 02102 harrow 02102 hatter 02102 jennet
02102
kennel 02102 killer 02102 kipper 02102 kisser 02102 kitten 02102 lammas
02102
lappet 02102 latter 02102 lerret 02102 lessen 02102 lesser 02102 lessor
02102
letter 02102 litter 02102 mallet 02102 mammal 02102 manner 02102 marrow
02102
matter 02102 miller 02102 millet 02102 mirror 02102 mitten 02102 mizzen
02102
narrow 02102 natter 02102 nipper 02102 pallet 02102 parrot 02102 passim
02102
patten 02102 patter 02102 pellet 02102 pepper 02102 pillar 02102 potter
02102
powwow 02102 rammer 02102 rappel 02102 rattan 02102 reggae 02102 rillet
02102
rotten 02102 rotter 02102 sapper 02102 seller 02102 setter 02102 simmer
02102
sinner 02102 sippet 02102 sirrah 02102 sitter 02102 sorrel 02102 sorrow
02102
tanner 02102 tassel 02102 tatter 02102 teller 02102 tenner 02102 tennis
02102
terret 02102 terror 02102 tetter 02102 tiller 02102 tippet 02102 titter
02102
topper 02102 totter 02102 valley 02102 vassal 02102 vennel 02102 vessel
02102
wallet 02102 warren 02102 winner 02102 yammer 02102 yarrow 02102 zaffer
02102
zipper 02102
Eight letter rhythms:
real and artificial
$ paste <(shuf -ern 41000 {a..z}|xargs -L 8|sed 's/\ //g'|awk
'{for (i=1;i<NF;i++) {if ($i>$(i+1))s=s""0;else if
($i<$(i+1))s=s""2;else s=s""1};{print s" "$0;s="";}} ' FS=""|awk
'{print $2,$1}'|sort -k2|uniq -cf1|sort -n|tail -20) <(egrep
^.{8}$ $w|awk '{for (i=1;i<NF;i++) {if
($i>$(i+1))s=s""0;else if ($i<$(i+1))s=s""2;else
s=s""1};{print s" "$0;s="";}} ' FS=""|awk '{print $2,$1}'|sort -k2|uniq
-cf1|sort -n|tail -20)
59 cayeaeyh
0200220 59 apiarian 2002002
60 ageamwpu
2002202 61 abjectly 2200202
63 hdauctlc
0020200 63 babushka 0220020
64 ajcbsdsy
2002022 64 alarmist 2020022
65 cbfyfrsk
0220220 66 headland 0022020
67 abvpdzst
2200202 71 backdrop 0220202
69 afbudzfa
2020200 74 alacrity 2022022
75 ihaqpvuy
0020202 75 amenable 2020220
76 dbdcprnw
0202202 87 barbican 0202002
78 dcogogep
0202002 90 balister 0202202
81 acobrqsf
2202020 91 alkahest 2002022
82 ajfocfrp
2020220 93 acarpous 2020020
86 agcfxhvb
2022020 93 bargeman 0200202
87 baqnclkx
0200202 102 actively 2202022
91 canjpghr
0202022 123 acanthus 2022020
94 ajedfewl
2002020 132 ablation 2202020
94 aoevpozg
2020020 135 bakeshop 0202022
100 baetkvdp
0220202 139 alfresco 2020202
154 dcectqxe
0202020 145 alienage 2002020
166 ajgteico
2020202 227 balanced 0202020
Letter
gap differentials
$ egrep ^.{4}$ $w|awk 'BEGIN { C = "" ; for ( i = 0 ; ++i
< 256 ; ) C = C sprintf ( "%c" , i ) };{for
(i=1;i<NF;i++) {s=s"."(index(C,$(i+1))-index(C,$i))};{print s"
"$0;s="";}} ' FS=""|awk '{print $2,$1}'|sort -k2|uniq -cf1|sort -n|tail
-20
2 hide .1.-5.1
2 john .5.-7.6
2 lean .-7.-4.13
2 link .-3.5.-3
2 lion .-3.6.-1
2 loaf .3.-14.5
2 loch .3.-12.5
2 meed .-8.0.-1
2 milt .-4.3.8
2 mold .2.-3.-8
2 molt .2.-3.8
2 opal .1.-15.11
2 open .1.-11.9
2 pail .-15.8.3
2 pelt .-11.7.8
2 proa .2.-3.-14
2 punk .5.-7.-3
2 spec .-3.-11.-2
3 abba .1.0.-1
3 lang .-11.13.-7
$ egrep ^.{4}$ $w|awk 'BEGIN { C = "" ; for ( i = 0 ; ++i
< 256 ; ) C = C sprintf ( "%c" , i ) };{for
(i=1;i<NF;i++) {s=s"."(index(C,$(i+1))-index(C,$i))};{print s"
"$0;s="";}} ' FS=""|awk '{print $2,$1}'|grep 1.0.-1
abba deed noon.1.0.-1
lang perk shun.-11.13.-7
$ egrep ^.{3}$ $w|awk 'BEGIN { C = "" ; for ( i = 0 ; ++i
< 256 ; ) C = C sprintf ( "%c" , i ) };{for
(i=1;i<NF;i++) {s=s"."(index(C,$(i+1))-index(C,$i))};{print s"
"$0;s="";}} ' FS=""|awk '{print $2,$1}'|grep "\.3\.0"
add bee ill loo.3.0
$ egrep ^.{5}$ $w|awk 'BEGIN { C = "" ; for ( i = 0 ; ++i
< 256 ; ) C = C sprintf ( "%c" , i ) };{for
(i=1;i<NF;i++) {s=s"."(index(C,$(i+1))-index(C,$i))};{print s"
"$0;s="";}} ' FS=""|awk '{print $2,$1}'|sort -k2|uniq -cf1|sort -n|tail
-5
1 zocle .-11.-12.9.-7
2 chain .5.-7.8.5
2 cheer .5.-3.0.13
2 opera .1.-11.13.-17
2 pecan .-11.-2.-2.13
A different kind of rhyme: Words
with rhyming gap differentials
$ egrep ^.{5}$ $w|awk 'BEGIN { C = "" ; for ( i = 0 ; ++i
< 256 ; ) C = C sprintf ( "%c" , i ) };{for
(i=1;i<NF;i++) {s=s"."(index(C,$(i+1))-index(C,$i))};{print s"
"$0;s="";}} ' FS=""|awk '{print $2,$1}'|grep ".-11.-2.-2.13"
Etc. for opera, cheer, chain
pecan tiger .-11.-2.-2.13
opera stive .1.-11.13.-17
cheer jolly .5.-3.0.13
chain ingot .5.-7.8.5
Bigger dictionary ($T)
$ egrep ^.{7}$ $T|awk 'BEGIN { C = "" ; for ( i = 0 ; ++i
< 256 ; ) C = C sprintf ( "%c" , i ) };{for
(i=1;i<NF;i++) {s=s"."(index(C,$(i+1))-index(C,$i))};{print s"
"$0;s="";}} ' FS=""|awk '{print $2,$1}'|sort -k2|uniq -cf1|sort -n|tail
-5
1 zymotic
.-1.-12.2.5.-11.-6
1 zymurgy
.-1.-12.8.-3.-11.18
1 zyzzyva
.-1.1.0.-1.-3.-21
2 fortran
.9.3.2.-2.-17.13 (FORTRAN)
2 primero
sulphur.2.-9.4.-8.13.-3
steeds tuffet .1.-15.0.-1.15
paopao testes .-15.14.1.-15.14
inkier purply .5.-3.-2.-4.13
alohas grungy .11.3.-7.-7.18
anteed bouffe .13.6.-15.0.-1
pinot .-7.5.1.5 unsty .-7.5.1.5
mocha .2.-12.5.-7 suing .2.-12.5.-7
labor .-11.1.13.3 shivy .-11.1.13.3
ebola .-3.13.-3.-11 herod .-3.13.-3.-11
cobra .12.-13.16.-17 freud .12.-13.16.-17
banjo .-1.13.-4.5 ferns .-1.13.-4.5
$ egrep ^[a-z]{3}$ $w|awk '{c=0; for (i=1; i<=NF; ++i) { if ($i
< $(i+1)) c++ }; if (c>NF-2) print $0,NF }' FS=""|sort
-nk2|wc
86
172 516
$ egrep ^[a-z]{3}$ $w|awk '{c=0; for (i=1; i<=NF; ++i) { if ($i
< $(i+1)) c++ }; if (c>NF-2) print $0,NF }' FS=""|head
-30|xargs -L10
ace 3 act 3 ado 3 aft 3 ago 3 ail 3 aim 3 air 3 alp 3 amp 3
ant 3 any 3 apt 3 art 3 beg 3 bel 3 ben 3 bet 3 bey 3 bin 3
bis 3 bit 3 biz 3 bow 3 box 3 boy 3 buy 3 cop 3 cot 3 cow 3
Nondecreasing:
threes:
$ egrep ^[a-z]{3}$ $w|awk '{c=0; for (i=1; i<=NF; ++i) { if ($i
<= $(i+1)) c++ }; if (c>NF-2) print $0,NF }' FS=""|wc
102
204 612
(includes, for example, eel, inn and moo that are not strictly
monotonic)
$ echo {a..z}| sed 's/[ ]/*/g;s/z/z*/'
a*b*c*d*e*f*g*h*i*j*k*l*m*n*o*p*q*r*s*t*u*v*w*x*y*z*
$ grep ^`echo {a..z}| sed 's/[ ]/*/g;s/z/z*/'`$ $w|wc
310
310 1496
$ grep ^`echo {a..z}| sed 's/[ ]/*/g;s/z/z*/'`$ $w|xargs -L10|head -5
a abbess abbey abbot abet abhor ably abort accent accept
access accost ace act add adder adept adit ado adopt
aegis affix afflux afoot aft agio aglow ago ah ail
aim air airy all alloquy allot allow alloy ally almost
alms alp am amp amps an annoy ant any apt
Nonincreasing:
$ grep ^`echo {z..a}| sed 's/[ ]/*/g;s/a/a*/'`$ $w|wc
196
196 900
Only 196 of these, as opposed to 310 nondecreasing
$ grep ^`echo {z..a}| sed 's/[ ]/*/g;s/a/a*/'`$ $w|xargs -L10|tail -5
unfed up upon urge urn us use used via vie
void vomica we web wed wee weed wife wig wigged
woe woke wold wolf womb won woo wood woof wool
woon wrong x ye yea yob yoga yoke yolk yon
yucca yule yuppie zone zoo zoom
So, for a random string of length three to be monotonic increasing, we
must have all three chars distinct. Of the 26^3 = 17576 strings of
length three, 26* 25* 24 of them have three distinct chars. So P(3
distinct) = 26*25*24/26^3 ≈ .888. Once three distinct chars are chosen,
each of the six orderings (abc, acb, bac, bca, cab and cba) is equally
likely, and only one is monotonic increasing. Hence the probability of
getting three chars, at random, to be monotonic increasing is about
.148 . The same would be true of the probability of having three chars
being monotonic decreasing.
Given that there are 587 three letter words in $w *, we’d
expect (26*25*24/(6*26^3))*587 or about 86.83 to be monotonic
increasing and the same number to be monotonic decreasing.
Sure enough, there are 86 increasing words:
$ egrep ^[a-z]{3}$ $w|awk '{c=0; for (i=1; i<=NF; ++i) { if ($i
< $(i+1)) c++ }; if (c>NF-2) print $0 }' FS=""|wc
86
86 344
$ egrep ^[a-z]{3}$ $w|awk '{c=0; for (i=1; i<=NF; ++i) { if ($i
< $(i+1)) c++ }; if (c>NF-2) print $0 }' FS=""|tail
-30|xargs -L15
fry gin gnu got guy him hip his hit hop hot how hoy imp ivy
jot joy lop lot low lox loy mop mow nor not now opt pry sty
But only 57 decreasing ones:
$ egrep ^[a-z]{3}$ $w|awk '{c=0; for (i=1; i<=NF; ++i) { if ($i
> $(i+1)) c++ }; if (c>NF-1) print $0 }' FS=""|wc
57
57 228
$ egrep ^[a-z]{3}$ $w|awk '{c=0; for (i=1; i<=NF; ++i) { if ($i
> $(i+1)) c++ }; if (c>NF-1) print $0 }' FS=""|tail
-30|xargs -L15
sec she sib sic ski sob sod son spa tea ted the tic tie tod
toe tom ton urn use via vie web wed wig woe won yea yob yon
* $ egrep ^[a-z]{3}$ $w|wc
587
587 2348
examples:
$ egrep ^[a-z]{3}$ $w|tail -45|xargs -L15
vim vow wad wag wan war was wat wax way web wed wee wem wen
wet who why wig win wit woe won woo wop wot wry yak yam yap
yaw yea yen yes yet yew yin yip yob yon you zap zip zit zoo
Four letter words
For four letters, the probability of four random letters being all
different is
(26*25*24*23/(26^4)) ≈.785 .
Once all four letters are different, the likelihood of being
monotonically increasing would be 1/24 (given 4! permutations of the
letters, with only one of those being as desired).
(26*25*24*23/(26^4))/24≈ .0327.
$ egrep ^[a-z]{4}$ $w|tail -45|xargs
-L15
word wore work worm worn wove wrap wren writ wynd yang yank yard yare
yarn
yarr yaup yawl yawn yawp yean year yell yelp yerk yeti yipe yoga yogi
yoho
yoke yolk yore your yule zany zarp zeal zebu zero zest zinc zone zoom
zoot
$ egrep ^[a-z]{4}$ $w|wc
1953
1953 9765
We would thus, expect about 1953 * .0327≈63.89 of the four letter words
to increase alphabetically.
Sure enough,
$ egrep ^[a-z]{4}$ $w|awk '{c=0; for (i=1; i<=NF; ++i) { if ($i
< $(i+1)) c++ }; if (c>NF-2) print $0 }' FS=""|wc
61
61 305
$ egrep ^[a-z]{4}$ $w|awk '{c=0; for (i=1; i<=NF; ++i) { if ($i
< $(i+1)) c++ }; if (c>NF-2) print $0 }' FS=""|xargs -L16
abet ably adit agio airy alms amps arty belt bent best bevy blot blow
cent chin
chip chit chop chow city clot cloy copy cost cosy crux deft defy demo
dent deny
dewy dint dirt dory doxy envy film fist flop flow flux fort foxy gilt
gimp girt
gist glow gory hilt hint hist hops host knot know lost most nosy
However, again, the reversals seem not to hold up their end of the
probability distribution:
$ egrep ^[a-z]{4}$ $w|awk '{c=0; for (i=1; i<=NF; ++i) { if ($i
> $(i+1)) c++ }; if (c>NF-1) print $0 }' FS=""|wc
48
48 240
$ egrep ^[a-z]{4}$ $w|awk '{c=0; for (i=1; i<=NF; ++i) { if ($i
> $(i+1)) c++ }; if (c>NF-1) print $0 }' FS=""|xargs -L16
life mica olid pica pied plea poke pole pond rife role shed skid sled
slid soda
sofa soke sold sole some song spec sped spic tied toga told tomb tome
tone tong
trig trod upon urge used void wife woke wold wolf womb yoga yoke yolk
yule zone
Five letter words:
$ egrep ^[a-z]{5}$ $w|wc
2892
2892 17352
$ egrep ^[a-z]{5}$ $w|tail -36|xargs
-L12
worth would wound woven wrack wrath wreak wreck wrest wring wrist write
wrong wrote wrung wryly xebec xenia xerox yacht yahoo yamen yearn yeast
yield yodel yokel young yours youth yucca zambo zebra zilch zippo zocle
Increasing:
$ egrep ^[a-z]{5}$ $w|awk '{c=0; for (i=1; i<=NF; ++i) { if ($i
< $(i+1)) c++ }; if (c>NF-2) print $0 }' FS=""|xargs -L12
abhor abort adept adopt aegis aglow befit begin begot below bijou chimp
deist deity dirty empty filmy first forty ghost gipsy glory mopsy
]$ egrep ^[a-z]{5}$ $w|awk '{c=0; for (i=1; i<=NF; ++i) { if ($i
< $(i+1)) c++ }; if (c>NF-2) print $0 }' FS=""|wc
23
23 138
Decreasing:
$ egrep ^[a-z]{5}$ $w|awk '{c=0; for (i=1; i<=NF; ++i) { if ($i
> $(i+1)) c++ }; if (c>NF-1) print $0 }' FS=""|wc
8
8 48
$ egrep ^[a-z]{5}$ $w|awk '{c=0; for (i=1; i<=NF; ++i) { if ($i
> $(i+1)) c++ }; if (c>NF-1) print $0 }' FS=""|xargs -L12
polka solid sonic spoke theca tonic unfed wrong
Expectation:
About 2/3 of 5 letter sequences would have all five letters different:
((26*25*24*23*22/(26^5))) ≈ .6644.
But those 5 letters must all be in the proper order (which happens with
probability only 1/5! or 1/120 )
((26*25*24*23*22/(26^5))/120) ≈ 0.005536
With 2892 five letter words, then we’d expect
((26*25*24*23*22/(26^5))/120)* 2892 ≈ 16.011 for both increasing and
decreasing.
Are variations as wide as 23 (increasing) and 8 (decreasing) within the
realm of randomness?
Here are some random trials. The script generates 14460 chars in 2892
groups of five letter words and then sorts the words based on their
internal rhythms (see more on this topic later). We restrict the output
to the strictly increasing sequences (2222) or the scrictly decreasing
ones (0000). A few trials are run just to give an idea
]$ shuf -ern 14460 {a..z}|xargs -L 5|sed 's/\ //g'|awk '{for
(i=1;i<NF;i++) {if ($i>$(i+1))s=s""0;else if
($i<$(i+1))s=s""2;else s=s""1};{print s" "$0;s="";}} ' FS=""|awk
'{print $2,$1}'|sort -k2|uniq -cf1|sort -n|egrep "([02])\1\1\1"
10 aboqy 2222
15 rkiha 0000
12 nhgcb 0000
18 adkpq 2222
12 aisuw 2222
18 tngfe 0000
11 acfst 2222
23 igfba 0000
13 aglvx 2222
13 jhfea 0000
15 nmjhf 0000
18 acimz 2222
14 mihfa 0000
16 adflr 2222
8 adhtz 2222
18 roidc 0000
2892 * 5 = 14460
Sure enough, variations as wide as observed among real words are seen
as entirely possible within the laws of chance.
Six
((26*25*24*23*22*21/(26^6))) ≈ 0.5366
((26*25*24*23*22*21/(26^6)))/720 ≈ 0.00074528404
$ egrep ^[a-z]{6}$ $w|wc
4278
4278 29946
4278*((26*25*24*23*22*21/(26^6)))/720 ≈ 3.188 = expected number of
monotonic (up or down) sequences for six letter strings.
$ expr 4278 "*" 6
25668
$ shuf -ern 25668 {a..z}|xargs -L 6|sed 's/\ //g'|awk '{for
(i=1;i<NF;i++) {if ($i>$(i+1))s=s""0;else if
($i<$(i+1))s=s""2;else s=s""1};{print s" "$0;s="";}} ' FS=""|awk
'{print $2,$1}'|sort -k2|uniq -cf1|sort -n|egrep "([012])\1\1\1\1"
4 aejkls 22222
5 utokdc 00000
2 abdhrs 22222
3 ysonga 00000
3 abltuv 22222
3 wtoldc 00000
1 eimpqv 22222
2 vupkga 00000
0
00000
2 cefjmy 22222
4 omjfea 00000
6 aekqtx 22222
3 ahmotv 22222
3 toieca 00000
4 pmhgfd 00000
5 adfhln 22222
$ egrep ^[a-z]{6}$ $w|awk '{for (i=1;i<NF;i++) {if
($i>$(i+1))s=s""0;else if ($i<$(i+1))s=s""2;else
s=s""1};{print s" "$0;s="";}} ' FS=""|awk '{print $2,$1}'|sort -k2|uniq
-cf1|sort -n|egrep "([012])\1\1\1\1"
2 sponge 00000
5 almost 22222
Seven
((26*25*24*23*22*21*20/(26^7)))≈ 0.4128
((26*25*24*23*22*21*20/(26^7)))/5040 ≈ 0.0000819
$ egrep ^[a-z]{7}$ $w|wc
4854
4854 38832
4854 *
((26*25*24*23*22*21*20/(26^7)))/5040 ≈ 0.3975= expected
number of monotonic (up or down) sequences for seven letter strings.
$ expr 4278 "*" 6
25668
$ egrep ^[a-z]{7}$ $w|awk '{for (i=1;i<NF;i++) {if
($i>$(i+1))s=s""0;else if ($i<$(i+1))s=s""2;else
s=s""1};{print s" "$0;s="";}} ' FS=""|awk '{print $2,$1}'|sort -k2|uniq
-cf1|sort -n|egrep "([012])\1\1\1\1"
1 dyspnea 200000
2 obloquy 022222
2 polecat 000002
$ egrep ^[a-z]{7}$ $w|awk '{for (i=1;i<NF;i++) {if
($i>$(i+1))s=s""0;else if ($i<$(i+1))s=s""2;else
s=s""1};{print s" "$0;s="";}} ' FS=""|awk '{print $2,$1}'|egrep
"([012])\1\1\1\1"
dyspnea 200000
obloquy 022222
polecat 000002
sponger 000002
thirsty 022222
Demonstrates that there are no strictly monotonic sequences of length 7
in $w. In fact there are none of length seven or higher.
$ wc $T $w
406712 406712 4158156
/home/ddailey/public_html/moby/mthes/TwoOrMore
35916 35916 332173
/home/ddailey/public_html/words
In the much larger dictionary ($T), there are a couple:
$ egrep ^[a-z]{7}$ $T|awk '{for (i=1;i<NF;i++) {if
($i>$(i+1))s=s""0;else if ($i<$(i+1))s=s""2;else
s=s""1};{print s" "$0;s="";}} ' FS=""|awk '{print $2,$1}'|sort -k2|uniq
-cf1|sort -n|egrep "([012])\1\1\1\1"
2 deglory 222222
2 sponged 000000
16 bailors 022222
19 lifeday 000002
22 avonlea 200000
25 abortus 222220
$ egrep ^[a-z]{7}$ $T|awk '{for (i=1;i<NF;i++) {if
($i>$(i+1))s=s""0;else if ($i<$(i+1))s=s""2;else
s=s""1};{print s" "$0;s="";}} ' FS=""|awk '{print $2,$1}'|egrep
"([012])\1\1\1\1\1"
deglory 222222
egilops 222222
sponged 000000
wronged 000000
and
[truncated?]
More
"standard" (acoustically obvious) rhythms
Counting chars in dict:
$ grep -o . $w|wc -l
296257
$ wc $w
35916 35916 332173 /home/ddailey/public_html/words
$ expr 296257 + 35916
332173
Dividing characters into
Consonants and Vowels
(sort of works for European languages -- not for Chinese)
Vowels:
$ grep -o "[aeiou]" $w|wc -l
114419
$ grep -io "[aeiou]" $w|wc -l
114444
$ grep -o "[AEIOU]" $w|wc -l
25
25 + 114419 = 114444
Consonants
$ grep -o "[bcdfghjklmnpqrstvwxyz]" $w|wc -l
180896
$ grep -oi "[bcdfghjklmnpqrstvwxyz]" $w|wc -l
180944
$ grep -o "[BCDFGHJKLMNPQRSTVWXYZ]" $w|wc -l
48
$ expr 48 + 180896
180944
Together:
$ grep -io "[aeiou]" $w|wc -l
114444
$ grep -oi "[bcdfghjklmnpqrstvwxyz]" $w|wc -l
180944
$ grep -o . $w|wc -l
296257
$ expr 114444 + 180944
295388
Nonalphabetic characters:
$ grep -oi "[^a-z]" $w|wc
869
869 1738
$ grep -oi "[^a-z]" $w|sort|uniq -c
746 -
30 ;
1 .
62 '
30 &
$ expr 746 + 30 + 1 + 62 + 30
869
$ expr 869 + 295388
296257
This shows a partition of the 296257 characters of $w =
/home/ddailey/public_html/words into:
Vowels: 114444
Consonants: 180944
And other: 869
$ wc /home/ddailey/public_html/moby/mthes/SixOrMore
66023 66023 595432
/home/ddailey/public_html/moby/mthes/SixOrMore
$ echo {A..Z} {a..z}|sed s/[aeiouAEIOU\ ]//g
BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz
$ paste <(head SixOrMore) <( head SixOrMore
|sed
's/[aeiouAEIOU]/A/g;s/[BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz]/C/g;s/A/V/g')
a V
a- V-
A V
aa VV
aah VVC
aahs VVCC
aardvark
VVCCCVCC
aardwolf
VVCCCVCC
aas VVC
ab VC
$ cat SixOrMore|sed -n '/^....$/s/[aeiouAEIOU]/A/gp'|sed
's/[BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz]/C/g;s/A/V/g'|sort|uniq
-c|sort -nr
1227 CVCC
662 CVCV
468 CVVC
410 CCVC
150 VCVC
68 VCCV
59 VCCC
49 CCVV
38 VVCC
32 CCCV
18 VCVV
10 VVCV
9 CVVV
3 V'VC
2 CVC-
2 CCV-
1 VVVC
1 'CVC
$ cat SixOrMore|sed -n '/^.....$/s/[aeiouAEIOU]/A/gp'|sed
's/[BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz]/C/g;s/A/V/g'|sort|uniq
-c|sort -nr
1340 CVCVC
908 CVCCC
721 CCVCC
507 CVCCV
490 CVVCC
303 CCVCV
297 CCVVC
247 VCCVC
133 VCVCC
123 CVVCV
118 VCVCV
107 CVCVV
78 CCCVC
69 VCVVC
45 VVCVC
25 VCCVV
24 CVVVC
21 VCCCV
20 VCCCC
17 CCCCV
12 VVCCC
9 CCCVV
7 VVCCV
5 CV-CV
4 CVC'C
3 VVCVV
3 CVVVV
2 CV'VC
2 CVCC-
1 VVC'C
1 VCVVV
1 VCV'C
1 VCC'C
1 CV-VC
1 CVïCV
1 CVCV-
1 CV'CV
1 CV-CC
1 C-CVC
1 'CCVC
cat SixOrMore|sed -n '/^......$/s/[aeiouAEIOU]/A/gp'|sed
's/[BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz]/C/g;s/A/V/g'|sort|uniq
-c|sort -nr|head -50
2308 CVCCVC
905 CVCVCC
620 CCVCVC
501 CCVCCC
497 CVVCVC
492 CVCVCV
380 CVCVVC
328 VCCVCC
257 CCVVCC
239 CVCCCV
193 VCCVCV
180 VCVCVC
178 CVVCCC
157 CVCCVV
151 CVCCCC
128 CCVCCV
125 VCCVVC
111 CCCVCC
104 VCCCVC
103 CVVCCV
64 CCVVCV
59 CCCVCV
57 CCCCVC
49 VVCCVC
45 VCVCCC
41 CVVCVV
40 VCVVCC
37 VCVCCV
35 CCCVVC
31 VVCVCC
28 CCVCVV
21 VCVVCV
18 VCVCVV
17 VVCVCV
14 CVVVCC
9 VCCCVV
8 CCCCCV
7 VVCVVC
7 VVCCCC
7 VCCCCV
5 CVCVVV
5 CCCCVV
4 CCVVVC
3 CVVVCV
3 CVCC'C
2 VVCCVV
2 VVCCCV
2 VCCCCC
2 CVVVVC
2 CV-VCC
Seven letters:
$ cat SixOrMore|sed -n '/^.......$/s/[aeiouAEIOU]/A/gp'|sed
's/[BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz]/C/g;s/A/V/g'|sort|uniq
-c|sort -nr|head -50
1824 CVCCVCC
928 CCVCCVC
821 CVCVCVC
694 CVCCCVC
623 CVCCVCV
546 CVCCVVC
394 CVVCVCC
361 CVVCCVC
360 CCVCVCC
333 VCCVCVC
263 CCVVCVC
231 CVCVCCV
195 CVCVCCC
167 CCVCVCV
166 CVCVVCC
147 VCVCVCC
137 CCVCVVC
136 VCCCVCC
125 VCVCVCV
111 VCVCCVC
111 VCCVCCC
110 CCVCCCV
89 CVCVVCV
88 VCCVVCC
87 CCCVCVC
85 VCCCVCV
85 CVVCVCV
82 VCCVCCV
82 CCVCCCC
78 CVCVCVV
72 CCVVCCC
65 CVVCVVC
62 VCVCVVC
60 VCCCVVC
47 VVCCVCC
44 CCCVCCC
40 CVCVVVC
36 CCCCVCC
35 CVVCCCC
34 VCVVCVC
30 VCCVVCV
28 CCVCCVV
26 VCCCCVC
26 CCVVCCV
25 VVCCCVC
25 CCCVVCC
24 CCCCCVC
22 VVCVCVC
22 VCCVCVV
22 CCCCVVC
Eight letters
$ cat SixOrMore|sed -n '/^........$/s/[aeiouAEIOU]/A/gp'|sed
's/[BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz]/C/g;s/A/V/g'|sort|uniq
-c|sort -nr|head -50
927 CVCCVCVC
835 CVCVCVCC
742 CCVCCVCC
623 CVCCCVCC
462 CVCVCCVC
422 CVCVCVCV
370 CVCVCVVC
349 CVVCCVCC
332 VCCVCCVC
328 CVCCVCCC
268 VCCVCVCC
261 CCVCCCVC
239 CCVCVCVC
227 CCVVCVCC
192 CVCCVVCC
192 CVCCVCCV
191 CVCCCVCV
182 CVCCCVVC
141 VCCVCVCV
135 VCCCVCVC
132 CVCVVCVC
123 VCVCVCVC
121 VCVCCVCC
120 CVCCCCVC
116 CVVCVCVC
115 CCVCCVVC
111 CCVCCVCV
105 VCCVCVVC
92 CVVCCCVC
90 CCVVCCVC
85 CVVCCVCV
84 CCCVCCVC
77 VCCVVCVC
75 CVVCCVVC
74 VCVCCVCV
67 CVVCVCCC
65 CVCCVVCV
61 VCVCCVVC
60 CCVCVCCC
57 CVCVCCCC
56 CCVCVCCV
52 CVCVVCCV
52 CVCCVCVV
47 CVCVVCCC
43 CVVCVCCV
42 CCCVCVCC
41 VCVCCCVC
38 CCVCVVCC
38 CCCCVCVC
37 CCVVCVCV
Spanish
$echo $s
es.txt
data$pwd
/home/SRUNET/david.dailey/data
Most frequent characters
$cat $s|sed 's/\ .*//;s/./&\n/g'|awk '!/^$/'|sort|uniq -c|sort
-nr|head -50
537718 a
454007 e
353327 r
342698 o
336226 i
313406 s
295433 n
224557 t
215427 l
189358 c
165198 d
148208 m
135614 u
101579 p
82573 b
79228 g
69799 h
52981 v
46424 f
37876 k
30150 y
29973 á
29960 z
25592 í
25280 j
24380 é
17352 ó
14592 w
14034 q
10128 x
4989 ñ
4261 ú
2045 ò
1754 à
1732 ô
1588 â
1441 ï
1170 è
885 ü
721 ì
701 ê
591 ソ
466 ã
438 ö
401 ż
396 ą
358 ç
317 î
309 ä
247 û
$grep ソ $s
ソpor 16
ソest疽 16
ソte 12
ソno 12
ソpuedo 8
ソde 8
ソeres 7
ソes 6
ソqui駭 6
A Google search for ‘ソest疽’ reveals about 5000 hits, including
https://commons.wikimedia.org/wiki/TimedText:The_Million_Ryo_Pot_(1935).webm.ja.srt
Entitled “Japanese subtitles for clip: File:The Million Ryo Pot
(1935).webm” , the page has 1219 entries, many of which appear to be
Spanish with frequent transcription errors: e.g.
725
00:52:24,546 --> 00:52:27,276
Es la segunda casa desde la esquina,
delante de un pozo. No tiene p駻dida.
$cat $s|sed 's/\ .*//;s/./&\n/g'|awk '!/^$/'|sort|uniq -c|sort
-nr|head -50|awk '{print $2}'|tr '\n' ' '
a e r o i s n t l c d m u p b g h v f k y á z í j é ó w q x ñ ú ò à ô â
ï è ü ì ê ソ ã ö ż ą ç î ä û
a$v=[aeoiuáíéóúòàôâïèüìêãöąîäû]
data$c=[rsntlcdmpbghvfkyzjwqxñżç]
Spanish 4:
$awk '{print $1}' $s|sed -n
'/^.\{4\}$/s/[aeoiuáíéóúòàôâïèüìêãöąîäû]/V/gp'|sed
's/[rsntlcdmpbghvfkyzjwqxñżç]/C/g'|sort|uniq -c|sort -nr|head -24
6082 CVCV
3453 CVCC
2066 CVVC
1865 CCVC
1520 VCVC
1251 VCCV
568 CCCV
562 CVVV
506 VCVV
498 CCVV
467 VVCV
441 VCCC
165 VVCC
124 VVVC
62 VVVV
14 ソCVC
13 CVCž
9 CVńV
8 ĺźCV
8 CVëC
6 CVCแ
5 CVýV
5 CVCù
5 CVC
English 4
$cat $e|sed 's/\ .*//;s/./&\n/g'|awk '!/^$/'|sort|uniq -c|sort
-nr|head -50|awk '{print $2}'|tr '\n' ' '
$cat $e|sed 's/\ .*//;s/./&\n/g'|awk '!/^$/'|sort|uniq -c|sort
-nr|head -50|awk '{print $2}'|tr '\n' ' '
e a i r o n s t l u c h d m g p b k y f v w z j x q é ÿ í á ï ä ó è ö ñ
î þ ã а ü о е и ç å à ý ê т
data$ev="e a i o u é ÿ í á ï ä ó è ö î ã а ü о е и å à
ê"
data$ec="r n s t l c h d m g p b k y f v w z j x q ñ þ ç
т"
data$echo $ec|sed 's/\ //g'
rnstlchdmgpbkyfvwzjxqñþçт
data$echo $ev|sed 's/\ //g'
eaiouéÿíáïäóèöîãаüоеиåàê
$awk '{print $1}' $e|sed -n
'/^.\{4\}$/s/[eaiouéÿíáïäóèöîãаüоеиåàê]/V/gp'|sed
's/[rnstlchdmgpbkyfvwzjxqñþçт]/C/g'|sort|uniq -c|sort -nr|head -24
5235 CVCV
5079 CVCC
2704 CCVC
2500 CVVC
1651 VCVC
1269 VCCV
884 VCCC
881 CCCV
568 CCVV
514 CVVV
437 VVCC
420 VVCV
413 VCVV
172 VVVC
51 VVVV
18 CôCV
14 CVCô
13 ηVCC
12 CVšV
12 CVCò
11 CøCV
10 CâCV
9 CVCú
8 žVCV
Note that for four letter words, in both Spanish and English, CVCV is
the top-occuring pattern, while CVCC is second. Note also
that when I used the top fifty characters in English ‘ô’ and ‘ú’
clearly vowels didn’t appear in the top fifty. The above script could
clearly be refined, but it is interesting to note that the pattern CôCV
is slightly more frequent than VCCC or CCCC in this particular
vocabulary of the language. (some of the more frequent occurances:
$grep "^.ô..\ " $e côte 17 (as in Côte d’Azur), môle 14, côté 13, dôme
10, môme 6, cômo 5, côme 5, rôti 4 (as in poulet rôti - wrapped in
bacon, with purée and fennel
(https://www.tripadvisor.co.uk/LocationPhotoDirectLink-g186338-d1388950-i94968576-Cote_Brasserie_Covent_Garden-London_England.html)
), also in familiar appearance: lancôme 3,)
French and German (just for fun):
$wc $g $f
317388 634776 4573651 de.txt
305763 611526 3833939 fr.txt
623151 1246302 8407590 total
$cat $f $g|sed 's/\ .*//;s/./&\n/g'|awk '!/^$/'|sort|uniq
-c|sort -nr|head -60|awk '{print $2}'|tr '\n' ' '
e r n i a s t l o u h c g m d p b f k v é z w y ä ü j q x è ö ê ß ï í â
î ç ô á ž û à ó ì č š å ã ë ο ñ ú œ ę þ ù æ ÿ õ
$echo $fgv
eiasouéyäüèöêïíâîôáûàóìåãëοúœęùæÿõ
$echo $fgc|sed 's/\ //g'
rnstlhcgmdpbfkvzwyjqxßçžčšñþ
$awk '{print $1}' $f|sed -n
'/^.\{4\}$/s/[eiasouéyäüèöêïíâîôáûàóìåãëοúœęùæÿõ]/V/gp'|sed
's/[rnstlhcgmdpbfkvzwyjqxßçžčšñþ]/C/g'|sort|uniq -c|sort -nr|head -24
4184 CVCV
1720 CVCC
1597 CVVC
1171 CVVV
945 CCVC
903 VCVC
864 VVCV
838 VCCV
680 VCVV
553 CCVV
381 VVVC
304 VVVV
283 CCCV
272 VVCC
165 VCCC
5 CVCò
5 CòCV
5 CˆCV
4 CVďC
3 VCCò
3 CVC嶪
3 CVCŕ
3 CVCø
3 CøCV
German
$awk '{print $1}' $g|sed -n
'/^.\{4\}$/s/[eiasouéyäüèöêïíâîôáûàóìåãëοúœęùæÿõ]/V/gp'|sed
's/[rnstlhcgmdpbfkvzwyjqxßçžčšñþ]/C/g'|sort|uniq -c|sort -nr|head -24
2376 CVCV
1793 CVCC
1151 CVVC
729 CCVC
648 CVVV
635 VCVC
498 VCCV
474 VVCV
311 VCVV
291 VVCC
268 CCVV
251 VVVC
176 VCCC
169 VVVV
144 CCCV
4 κVCC
3 ηVCC
2 μVCC
2 ηVVC
2 ηVCV
2 εVCV
2 αVCV
2 αCVV
1 νVVC
Conclusion: It is interesting to note that for these four languages,
the most prevalent forms of Consonant-Vowel rhythms for four letter
words are, first: CVCV and second: CVCC).
English 5
$awk '{print $1}' $e|sed -n
'/^.\{5\}$/s/[eaiouéÿíáïäóèöîãаüоеиåàê]/V/gp'|sed
's/[rnstlchdmgpbkyfvwzjxqñþçт]/C/g'|sort|uniq -c|sort -nr|head -32
10562 CVCVC
8473 CVCCV
4573 CVCCC
3367 CCVCC
2736 CCVCV
2623 CVVCV
2550 VCCVC
2533 CVCVV
2403 CVVCC
1888 VCVCV
1419 CCVVC
1319 CCCVC
1075 VCVCC
583 CVVVC
568 VCVVC
534 VVCVC
512 CCCCV
507 VCCVV
490 VCCCV
337 VCCCC
243 VVCCC
216 VVCCV
214 CCCVV
170 CCVVV
126 VVVCC
109 CVVVV
105 VVCVV
74 VVVCV
67 VCVVV
37 VVVVC
25 VVVVV
17 CVυCC
Spanish 5
$awk '{print $1}' $s|sed -n
'/^.\{5\}$/s/[aeoiuáíéóúòàôâïèüìêãöąîäû]/V/gp'|sed
's/[rsntlcdmpbghvfkyzjwqxñżç]/C/g'|sort|uniq -c|sort -nr|head -32
10524 CVCVC
8301 CVCCV
3180 CVCVV
2962 CVVCV
2911 VCVCV
2743 CCVCV
2540 CVCCC
2080 VCCVC
1702 CCVCC
1247 CVVCC
904 CCVVC
657 CCCVC
578 CVVVC
577 VCCVV
555 VCVVC
451 VCVCC
437 VVCVC
381 VCCCV
339 VVCCV
272 CCCCV
248 CVVVV
149 VVVCV
136 VVCVV
133 CCVVV
132 CCCVV
117 VCCCC
106 VCVVV
77 VVCCC
48 VVVVC
37 VVVVV
29 VVVCC
13 ソCVCV
Note that for five letter words, in both Spanish and English, CVCVC is
the top-occuring pattern, while CVCCV is second.
Spanish 6
$awk '{print $1}' $s|sed -n
'/^.\{6\}$/s/[aeoiuáíéóúòàôâïèüìêãöąîäû]/V/gp'|sed
's/[rsntlcdmpbghvfkyzjwqxñżç]/C/g'|sort|uniq -c|sort -nr|head -32
13257 CVCCVC
13213 CVCVCV
3243 CVCVVC
3210 CCVCVC
3127 CVVCVC
2832 CVCCVV
2777 VCCVCV
2769 CVCVCC
2394 VCVCVC
2038 CCVCCV
1994 CVCCCV
1699 CVVCCV
1671 VCVCCV
910 CCVCCC
780 CCVVCV
767 CCVCVV
750 VCVCVV
730 VCCVCC
676 CVVCVV
647 VCCCVC
605 CVCCCC
596 VCCVVC
579 VCVVCV
509 CCVVCC
468 CVCVVV
432 CCCVCV
414 CVVVCV
396 VVCVCV
375 CCCVCC
343 CVVCCC
338 CCCCVC
270 VVCCVC
English 6
$awk '{print $1}' $e|sed -n
'/^.\{6\}$/s/[eaiouéÿíáïäóèöîãаüоеиåàê]/V/gp'|sed
's/[rnstlchdmgpbkyfvwzjxqñþçт]/C/g'|sort|uniq -c|sort -nr|head -32
15464 CVCCVC
8722 CVCVCV
4935 CVCVCC
3706 CCVCVC
3489 CVVCVC
2623 CVCCVV
2578 CVCCCV
2559 CVCVVC
2333 CCVCCV
1980 CCVCCC
1841 VCCVCC
1696 VCCVCV
1602 CVCCCC
1560 CVVCCV
1454 VCVCVC
1107 CCVVCC
981 CCCVCC
968 VCCCVC
943 CVVCCC
915 VCVCCV
843 CCCCVC
777 CVVCVV
737 CCVVCV
698 VCCVVC
653 CCCVCV
635 CCVCVV
420 VVCCVC
396 CCCVVC
366 VCVCVV
319 VCVCCC
317 VCVVCV
280 VVCVCC
Note that for six letter words, in both Spanish and English,
CVCCVC is the top-occuring pattern, while CVCVCV is
second. However, note some disagreement in lower ranked
patterns:
English (4935 CVCVCC)3 > (2559 CVCVVC)8
Spanish (2769 CVCVCC)8 < (3243 CVCVVC)3
Consononant Vowel rhythms in
English, Spanish, French and German vocabulary
English 7
$awk '{print $1}' $e|sed -n
'/^.\{7\}$/s/[eaiouéÿíáïäóèöîãаüоеиåàê]/V/gp'|sed
's/[rnstlchdmgpbkyfvwzjxqñþçт]/C/g'|sort|uniq -c|sort -nr|head -32
8710 CVCCVCC
6206 CVCCVCV
6140 CVCVCVC
5013 CVCCCVC
4789 CCVCCVC
4524 CVCVCCV
3145 CVCCVVC
2297 CVVCCVC
1789 CVVCVCC
1742 CCVCVCV
1707 CCVCVCC
1694 VCCVCVC
1463 CVCVCVV
1288 CCVVCVC
1242 CVCVVCV
1152 CVCVCCC
1091 VCVCVCV
1016 CVVCVCV
869 VCVCCVC
839 VCCVCCV
824 CVCVVCC
753 CCVCVVC
737 VCCCVCC
716 CCVCCCV
676 CVVCVVC
669 CCCVCVC
589 CVCCCCC
586 CCVCCVV
578 CCVCCCC
576 VCVCVCC
557 CVCCCVV
556 CCCCVCC
Spanish 7
$awk '{print $1}' $s|sed -n
'/^.\{7\}$/s/[aeoiuáíéóúòàôâïèüìêãöąîäû]/V/gp'|sed
's/[rsntlcdmpbghvfkyzjwqxñżç]/C/g'|sort|uniq -c|sort -nr|head -32
10517 CVCVCVC
9802 CVCCVCV
7033 CVCVCCV
4554 CVCCVCC
3273 CVCCCVC
3047 CCVCCVC
2997 VCVCVCV
2932 CVCCVVC
2729 CCVCVCV
2579 VCCVCVC
2540 CVCVCVV
2328 CVCVVCV
1860 CVVCVCV
1805 CVVCCVC
1755 VCCVCCV
1225 VCVCCVC
845 CCVVCVC
763 VCCVCVV
762 CCVCVCC
741 CCVCVVC
735 CVVCVCC
713 VCVCVVC
642 VCCCVCV
608 VCCVVCV
564 CVVCVVC
512 CVCVCCC
502 CCVCCVV
469 CCVCCCV
447 VCVVCVC
442 CVCCCVV
439 CVCVVVC
380 CCCVCVC
Let’s also look at French and German:
French 7
$awk '{print $1}' $f|sed -n
'/^.\{7\}$/s/[eiasouéyäüèöêïíâîôáûàóìåãëοúœęùæÿõ]/V/gp'|sed
's/[rnstlhcgmdpbfkvzwyjqxßçžčšñþ]/C/g'|sort|uniq -c|sort -nr|head -24
3774 CVCCVCV
2838 CVCVCCV
2692 CVCVCVV
2300 CVCVCVC
1876 CVCCVCC
1874 CVCVVCV
1486 CVCCVVC
1276 CVVCVCV
1136 CVCCVVV
1080 CVCCCVC
1017 VCVCVCV
1013 CCVCVCV
931 VCCVCVV
915 CVVCCVC
898 CCVCCVC
893 VCCVCCV
804 CVVCCVV
754 CVCVVVV
680 VCVCCVC
667 CCVCCVV
663 CVCCCVV
642 VCCVCVC
630 CVVCVVC
611 CVVCVCC
German 7
$awk '{print $1}' $g|sed -n
'/^.\{7\}$/s/[eiasouéyäüèöêïíâîôáûàóìåãëοúœęùæÿõ]/V/gp'|sed
's/[rnstlhcgmdpbfkvzwyjqxßçžčšñþ]/C/g'|sort|uniq -c|sort -nr|head -24
2214 CVCCVCV
2166 CVCCVCC
1577 CVCVCVC
1304 CVCCCVC
1234 CVCVCCV
967 CCVCCVC
911 CVCCVVC
846 CVVCCVC
762 CVCVCVV
686 VCCVCVC
665 VCVCCVC
650 CVVCVCV
642 CVCVVCV
620 CVVCVCC
565 CVCCVVV
564 VCCVCCV
461 CVCVCCC
453 CCVVCVC
393 VCVCVCV
393 CVCVVCC
390 VCCCVCC
376 CVVVCVC
367 CCVCVCV
354 CVCCCVV
Note that English ( 8710 CVCCVCC)1 > (6140 CVCVCVC)3
While Spanish (4554 CVCCVCC)4 < (10517 CVCVCVC)1
In French (1876 CVCCVCC) 5 < (2300
CVCVCVC)4
And in German (2166 CVCCVCC)2 > (1577 CVCVCVC) 3
Examples:
CVCrhythm
English Spanish
CVCCVCC
forward/selling raymond/bistecs
CVCCVCV
destiny/lottery soldado/cerrado
CVCVCVC
related/titanic sigamos/pedimos
CVCCCVC
matches/seltzer mostrar/manchas
CCVCCVC
stalled/bracket francos/prestar
CVCVCCV
bizarre/syringe podréis/cambios
CVCCVVC
passion/penguin viernes/sientas
CVVCCVC
neither/measles cierren/cuernos
VCVCVCV
ability/episode apetece/editado
CVCVCVV
someday/referee refería/delicia
CVCVVCV
genuine/release líquido/valiosa
CVVCVCV
sausage/seizure realeza/quemado
CVCCVVV
kumbaya/hawkeye desmayó/turquía
CCVCVCV
closely/precise llamaba/trasera
VCCVCVV
amnesia/antique odiaría/acuario
Seven letter sequences: comparisons of consonant-vowel rhythms across
English, Spanish, French and German:
Relative frequency for most popular CVC sequences relative to the total
number sampled.
The above table involved first choosing the eight most frequently
occurring sequences in English, and then “bootstrapping” outward so
that each language’s highest frequency entries were included.
CVCCVCC
CVCCVCV
CVCVCVC
CVCCCVC
CCVCCVC
CVCVCCV
CVCCVVC
CVVCCVC
VCVCVCV
CVCVCVV
CVCVVCV
CVVCVCV
CVCCVVV
CCVCVCV VCCVCVV
English
8710 6206
6140 5013
4789 4524
3145 2297
1454 1463
1241 1016
222 1742 366
Spanish
4554 9802
10517 3273
3047 7033
2932 1805
2997 2540
2328 1860
310 2729 734
French 1876
3774 2309
1080 1486
2838 1486
915 1017
2692 1874
1276 1136
1013 931
German 2166
2214 1577
1304 967
1234 911
846 393
762 642
650 565
367 704
Specifically, if as we see above,In order to “get to” the eight highest
sequences for English (CVVCCVC at 2297 in English but only 1805 in
Spanish) the following Spanish sequences were higher in frequency than
the Spanish value of this pattern: 1805. Namely, the sequences
(VCVCVCV:2997, VCVCVCV:2540, CVCVCVV:2328, CVCVVCV: 1860) all had to be
considered before inclusion of CVVCCVC could be entertained. This
method was extended until all four languages had represented in the
table, their top eight values. This required the addition of
seven more columns as can be seen.
Spanish 9
$awk '{print $1}' $s|sed -n
'/^.\{9\}$/s/[aeoiuáíéóúòàôâïèüìêãöąîäû]/V/gp'|sed
's/[rsntlcdmpbghvfkyzjwqxñżç]/C/g'|sort|uniq -c|sort -nr|head -32
6006 CVCVCVCVC
5452 CVCCVCVCV
4164 CVCVCCVCV
3651 CVCVCVCCV
3380 CVCCVCCVC
2702 VCCVCVCVC
1961 VCCVCCVCV
1868 VCCVCVCCV
1808 CVCCVCVVC
1420 CCVCVCVCV
1355 VCVCCVCVC
1297 CVCCVVCVC
1218 CVCCCVCVC
1136 VCVCVCVCV
1039 CVCVCVCVV
1002 CVCVCCVVC
989 CVCVCVVCV
946 CVCCVVCCV
939 CCVCCVCVC
933 VCVCCVCCV
741 CCVCVCCVC
740 VCCCVCVCV
734 CVCCCVCCV
705 CVCCVCVCC
644 CVVCVCVCV
627 VCVCVCCVC
612 CVCVVCVCV
605 CCVCCVCCV
602 CVVCCVCVC
551 CVCVVCCVC
543 CVCVCCVCC
533 CCVCVCVVC
English 9
$awk '{print $1}' $e|sed -n
'/^.\{9\}$/s/[eaiouéÿíáïäóèöîãаüоеиåàê]/V/gp'|sed
's/[rnstlchdmgpbkyfvwzjxqñþçт]/C/g'|sort|uniq -c|sort -nr|head -32
3240 CVCCVCCVC
2342 CVCCVCVCV
2295 CVCCVCVCC
1787 CVCVCVCVC
1571 CVCVCCVCC
1454 CVCVCCVCV
1428 CVCVCVCCV
1218 CVCCCVCVC
1156 CVCCVCVVC
1047 CCVCCCVCC
921 CCVCCVCVC
884 CVCCCCVCC
859 VCCVCCVCC
738 CVCCCVCCC
733 CCVCVCVCV
715 CCVCVCVCC
712 CCVCVCCVC
660 CVCVCCVVC
616 CVCCVVCVC
612 VCCVCVCVC
600 CVCVCCCVC
558 CVCCCVCCV
525 CVCVCVCCC
517 CCVVCCVCC
512 CVVCCCVCC
501 VCCVCCVCV
494 CVVCCVCVC
482 CVCCVCCCV
482 CCVCCVCCV
473 CVCCCVVCC
460 CCVCCVCCC
424 CVCCVCCCC
English 9 (from different dictionary )
$ cat SixOrMore|sed -n '/^.\{9\}$/s/[aeiouAEIOU]/A/gp'|sed
's/[BCDFGHJKLMNPQRSTVWXYZbcdfghjklmnpqrstvwxyz]/C/g;s/A/V/g'|sort|uniq
-c|sort -nr|head -32
640 CVCCVCVCC
403 CVCCVCCVC
381 CVCVCCVCC
322 CVCCVCVCV
314 CVCVCVCVC
314 CVCCVCVVC
267 VCCVCCVCC
200 CCVCCCVCC
184 CCVCVCVCC
170 CVCVCCVCV
164 CCVCCVCVC
156 VCCVCVCVC
151 CVCCCVCVC
130 CVCCCCVCC
129 CVCVCVCCC
124 CVCVCCVVC
122 VCCVCCVCV
119 CCVVCCVCC
110 CVCCVVCVC
106 CVVCCCVCC
103 CVCVVCVCC
98 CVCVCVCCV
98 CCVCVCCVC
97 CCVCVCVVC
90 CVVCVCVCC
88 VCCVCCVVC
87 CCVCVCVCV
85 VCVCVCVCC
82 VCCCVCVCC
82 CVCVCVVCC
80 VCCCVCCVC
78 CVCCCVCCC
Note that English (3240 CVCCVCCVC)1 > (1787 CVCVCVCVC)4
(generally consistent across both methods)
While Spanish (3380 CVCCVCCVC)5 < (6006 CVCVCVCVC)1