Guides

In Kitconc, the ‘Corpus’ class is the main object for the creation and manipulation of corpora. For better understanding, the class can be represented as follows:

corpus = Corpus(workspace, corpus_name, language)

For instantiating a Corpus object, you must pass the required values ​​to the following arguments:

  • workspace - [string] - a directory where the corpus is created and its processing files are stored.
  • corpus_name - [string] - an identification used for its access.
  • language - [string] - the language of the texts in order to identify the linguistic resources needed for processing.

It returns a reference to the corpus. The corpus processing is only effective when the ‘add_texts()’ function is executed.

corpus.add_texts(source_folder,**kwargs)

Arguments:

  • source_folder - [string] - path of the directory where the text files for the corpus are found;
  • tagged - [boolean] - indicates whether texts are already tagged;
  • show_progress - [boolean] - indicates whether the processing progress should be displayed on the console screen. The default value is ‘False’.

The Corpus contains all the functions and methods for processing the texts (wordlist(), keywords(), kwic() etc.).

See the examples in the following sections for a better understanding.

Downloading Examples

from kitconc.core import Examples 

Examples().download()

Creating a corpus

from kitconc.kit_corpus import Corpus 
# reference to the corpus
corpus = Corpus('kitconc-examples/workspace','ads','english')
# add texts from source folder
corpus.add_texts('kitconc-examples/ads',show_progress=True)

Creating a wordlist

from kitconc.kit_corpus import Corpus 
# reference to the corpus
corpus = Corpus('kitconc-examples/workspace','ads','english')
# make wordlist
wordlist = corpus.wordlist(show_progress=True)
# print the top 10 
print(wordlist.df.head(10))
# save Excel file
wordlist.save_excel(corpus.output_path + 'wordlist.xlsx') 

Extracting keywords

from kitconc.kit_corpus import Corpus 
corpus = Corpus('kitconc-examples/workspace','ads','english')
wordlist = corpus.wordlist(show_progress=True)
keywords = corpus.keywords(wordlist,show_progress=True)
print(keywords.df.head(10))
keywords.save_excel(corpus.output_path + 'keywords.xlsx')

Creating concordance lines - KWIC

from kitconc.kit_corpus import Corpus 
corpus = Corpus('kitconc-examples/workspace','ads','english')
kwic = corpus.kwic('experience',show_progress=True)
kwic.sort('R1','R2','R3')
print(kwic.df.head(10))
kwic.save_excel(corpus.output_path + 'kwic.xlsx',highlight='R1 R2 R3')

Creating concordance lines - sentences

from kitconc.kit_corpus import Corpus 
corpus = Corpus('kitconc-examples/workspace','ads','english')
concordances = corpus.concordance('experience',show_progress=True)
print(concordances.df.head(10))
concordances.save_excel(corpus.output_path + 'concordances.xlsx',highlight='R1 R2 R3')

Finding collocates

from kitconc.kit_corpus import Corpus 
corpus = Corpus('kitconc-examples/workspace','ads','english')
collocates = corpus.collocates('experience',left_span=2,right_span=2,coll_pos='IN NN JJ VBN VBD',show_progress=True)
print(collocates.df.head(10))
collocates.save_excel(corpus.output_path + 'collocates.xlsx')

Making clusters

from kitconc.kit_corpus import Corpus 
corpus = Corpus('kitconc-examples/workspace','ads','english')
clusters = corpus.clusters('experience',size=3,show_progress=True)
print(clusters.df.head(10))
clusters.save_excel(corpus.output_path + 'clusters.xlsx')

Making n-grams

from kitconc.kit_corpus import Corpus 
corpus = Corpus('kitconc-examples/workspace','ads','english')
ngrams = corpus.ngrams(size=3,pos='NN IN NN',show_progress=True)
print(ngrams.df.head(10))
ngrams.save_excel(corpus.output_path + 'ngrams.xlsx')

Creating dispersion plots

from kitconc.kit_corpus import Corpus 
corpus = Corpus('kitconc-examples/workspace','ads','english')
dispersion = corpus.dispersion('salary')
print(dispersion.df.head(10))
dispersion.save_excel(corpus.output_path + 'dispersion.xlsx')

Creating keywords dispersion plots

from kitconc.kit_corpus import Corpus 
corpus = Corpus('kitconc-examples/workspace','ads','english')
wordlist = corpus.wordlist(show_progress=True)
keywords = corpus.keywords(wordlist,show_progress=True)
keywords_dispersion = corpus.keywords_dispersion(keywords,show_progress=True)
print(keywords_dispersion.df.head(10))
keywords_dispersion.save_excel(corpus.output_path+'keywords_dispersion.xlsx')

Finding collocations

from kitconc.kit_corpus import Corpus 
corpus = Corpus('kitconc-examples/workspace','ads','english')
kwic = corpus.kwic('skills',show_progress=True)
collocations = corpus.collocations(kwic,show_progress=True)
print(collocations.df.head(10))
collocations.save_excel(corpus.output_path+'collocations.xlsx')
# plot a collocate distribution
collocations.plot_colldist('strong')

Plotting collocates

from kitconc.kit_corpus import Corpus 
corpus = Corpus('kitconc-examples/workspace','ads','english')
collocates = corpus.collocates('skills',left_span=3,right_span=3,coll_pos='NN JJ',show_progress=True)
print(collocates.df.head(10))
collocates.save_excel(corpus.output_path + 'collocates.xlsx')
# plot collocates
collocates.plot_collgraph(node='skills')