Machine Translation from English to Pashto using Predicate Logic

Machine Translation from English to Pashto using Predicate


Session 2014-2018


Machine translation (MT) is basically translation of given text by a computer system without human is basically the subfield of Natural language processing (NLP). Natural languages are ambiguous in such automation. So, MT is the process by which software translate source language to defined target language. The target translation could have at least same meaning and sense as the source language. English is the world most popular and well defined language in the area of linguistics as compared to Pushto language. So, translator which translates from English to Pushto would be very useful natural language processing.

This system which we have developed is Rule Based Machine Translation (RBMT). In RBMT approach, a given sentence is translated by mean of analogy. A dictionary for the words of both source and target language are used in this approach. The dictionary is accessible to the algorithm for machine translation. The sources language is passed from certain phases. In first step, we provide some input text from user or file. Then we tokenized in the each text word and assigned parts of speech tag to every word.  We have developed some grammar rules for the algorithm that perform translation. The actual job of the algorithm is to convert English sentences into Pushto sentences in such a way that the meaning and sense of both sentence remain same.

                                            Table of contents

Declaration. 2

Abstract 6

1.1 Artificial Intelligence AI 11

1.2 Natural Language Processing. 11

1.4 Machine Translation Methodologies. 12

1.5 Approaches to Machine Translation. 12

1.5.1 Rule Based Machine Translation (RBMT) 13

1.5.2 Dictionary Based Machine Translation. 13

1.5.3 Statistical Machine Translation (SMT) 13

1.6 Automata Theory. 14

1.6.1 Our Motivation. 14

2.1 Parts of speech in English Language. 16

2.2.2 Pronoun. 17

2.2.3 Verb. 17

2.2.4 Adjective. 17

2.2.5 Adverb. 17

2.2.6 Preposition. 17

2.2.7 Conjunction. 18

Determiners. 18

2.4 Parts of Speech in Pashto. 19

2.5 Part of Speech for Pushto. 20

Chapter 3    Patterns and Verifications of Phrases. 24

3.1 Chunking. 25

3.2 Phrases. 25

3.3 Regular Expressions RE. 25

3.4 Deterministic Finite Automata DFA. 26

3.5 Noun Phrase NP. 26

3.6 Pronoun Phrase. 26

3.7 Preposition Noun Phrase. 26

3.9 Predefined Phrase. 28

Chapter 4. 29

Proposed Representation and Mapping of Pashto and. 29

English Text 29

4.1 Predicate Logic PL. 30

4.2 Mapping of Pushto to English. 31

4.3 Swapping. 31

4.4 Pushto Language representation in PL. 32

4.4.1 Mapping of chunks to English. 32

4.5 Algorithms for proposed work. 32

4.5.1 Algorithm for English Logical representation. 32

4.5.4 Algorithm for Retrieval of English Text 33

4.5.5 Algorithm for Retrieval of English Text 33

Can use this for tag time. 33

Chapter 5     System Translation. 34

5.1 Taking input from a user 35

5.3 Tokenization. 36

5.5 PL representation for English language. 37

5.7 PL representation for Pushto language. 38

6.1 Conclusion. 41

6.2 Future work. 41

             Chapter 1             Introduction

Chapter 1                                                          Introduction

1.1 Artificial Intelligence AI

Artificial intelligence is a branch of computer science, which is concerned with development of intelligent machines which can perform like human being. Artificial intelligence (AI) is a term for simulation of intelligence in software running machines. AI machines are programmed to act like a human being.AI deals with the development of machines that can perform complex function and have the ability to judge on the basis of given data in a specific domain [1]. AI plays very important role in machine learning and making the machine functional. Knowledge representation for machines have made easy for machine to reason and learn with no need of external data [2]. The processing of natural languages in machine is the application of AI.

1.2 Natural Language Processing

Natural language processing (NLP) is the ability of a computer machine to understand human language known as Natural language. Natural Language Processing (NLP) is the branch of Artificial Intelligence which helps computer to understand basic natural human language and try to communicate them in their languages. NLP is used to provide real life communication between computer and individuals. NLP is used to translate defined languages to another defined language [3]. NLP is used to make easy human being to feel comfortable while talking to machines and in the same way machines are able to answer the human being in their natural way. The computer machine work on binary numbers but they give answer based on some logical define

1.3 Machine Translation

Machine translation is automatic translation of given text using specified rules from source language to a target language. The translation from source to target seems to be an easy task because we have already developed word-by-word pair dictionary which can translates words but in reality, it is difficult task to translate the text of one natural language to another language. In fact, machine translation uses some rules and algorithm to translate a text from one language to an equivalent of another language. Basically Translation is not a word-by-word base translation, but it needs extreme expertise in grammar and rules of both languages i.e. source and target [4].

Machine translation use four types of memories for their translation. Knowledge of language system [5]. Mostly the translation systems uses the grammatical competence of both source and target languages. Knowledge of language system is actually the thought assumptions, which come due to rich history of thousands of years of a language usage [6]. It depends on the basic structures being used for making the language. Actually they introduces the rules and grimmer which makes the language. Knowledge of the word and Knowledge of the situation; consist of the contextual knowledge which is mostly used in modern AI.

1.4 Machine Translation Methodologies

There are three top categories of machine translation methodologies, direct, inter-lingual and transfer. All the three methodologies is used to get the most equivalent meaning in the target language based of which source language is translated into target language [7].Directtranslation is the method of using word-for-word translation suppose a dictionary. It may be involved for lookup pair in dictionary. The linguistic representation and grammatical structure are not needed for the implementation of this type translation methodology. The semantictransfer occur that having a representation in source language, which consist of a series of structures having a direct meaning in the target language [8]. The transfer methodology are of two types, semantic and syntactic.During syntactic transfer methodology the sentences are parsed while representing every word by its part of speech and mapping the source language to target language using some specified transfer rules [9].

1.5 Approaches to Machine Translation

There are many approaches that can use for such task to translate. “The Actual translation is to achieving chunking base or dynamic equivalence between the source and target language still appears and needs to be a farfetched dream for computer linguists” [10]. But here we would only discuss the top common approaches which are discussed in details under below.

1.5.1 Rule Based Machine Translation (RBMT)

Rule based MT (RBMT) systems is based upon the given specified rules for syntax and lexical selection. Some defined rules and a bilingual lexicon are used in RBMT Rule based machine translation as used to translate source language into a target language by applying morphological analysis, syntactic analysis, and semantic analysis. RMBT applies some linguistic rules to it in three phases i.e. analysis, transfer and generation [11]. RMBT systems use Transfer Direct and interlingual approach for machine translation.

1.5.2 Dictionary Based Machine Translation

In Dictionary Based Machine Translation there is a dictionary for word by word translation without any correlation of semantics of the source and target languages. The dictionary based machine translation has the following process:  dictionary, Morph analyzer, and transliteration. Dictionary based machine translation is actually based on dictionary entries that words can be translated into other specified words. This often work with monogual dictionaries [12].

1.5.3 Statistical Machine Translation (SMT)

In statistical machine translation, the translation of text from one natural language to another language by a computer system as carried out by using statistical models. Statistical machine translation approach is used for translation bilingual corpora. “Analysis like bilingual text corpora and monolingual corpora basically generates statistical models that transform text from one language to another language [13].

1.5.4 Rule Based Machine Translation (RBMT)

Rulebased machine translation (RBMT) is a way that use of a bilingual quantity with parallel texts like SMT parallel or point to point mapping of words at the run-time execution as its main knowledge base [14].Rule base database is acquired usage at the run-time. RBMT can even work with small amount of data. RBMT is especially designed to translate with the help of previous used rule rather than adapting new linguistic rules. RBMT is used by many researchers to train and check the algorithm.

1.6 Automata Theory

Automata Theory is the abstract machine study for which automatic functionality have the ability to solve computational complex problems. It is a theory of computer science with discrete mathematics mean study in both mathematics and computer science [15]. Actually the automata theory makes the use of real mathematical models to perform defined automatic operations. Automata Theory is become the backbone of the modern computing functions. The applications of Automata Theory is computer languages and compilers [16].

1.6.1 Our Motivation

Our target was to develop a translation machine from English to Pushto on rule based machine translation. We have developed for an algorithm tagging system. Our focus was to build regular expressions rules which were supposed to make division automatically and also to develop algorithm which can really translate input text from English to Pushto. As Pushto is written and spoken language by Pushtoon all over the word, so we did our best to develop an English and Pushto translator for communication.             

                     Chapter 2            Background

Chapter 2                                                     Background

This chapter is the background of all the characteristic with the languages i.e. English and Pushto, and the parts of speech and how we tagged each part of speech in the given sentences in our target translation.

2.1 Parts of speech in English Language

All the words in English language are divided into eight parts of speech as show in figure 2.1 Also, the parts of speech are not only eight such as noun, verb, pronoun and adverb . Their sub categorization such as a noun may be singular, common and a pronoun may be a personal pronoun or a possessive pronoun. To develop a POS tagger, it must have a target for the language. Parts of speech are used for making a sentence by joining given words which will be grammatically correct and easily understandable and have sense [17]. Parts of speech in a language specially in English and Pushto are the most important part of our work domain and we have discussed that parts of speech. Finding and breaking Part of speech has important role in Machine translation to break the words and construct the rules and grammar from them. The tokenization of all words in a sentence would depend on this Part of speech.

 Figure 2.1

2.1.1 Noun

Noun is the name of a person such as wasim, a place like mingora, thing or an animal. Because of the very huge collection of the nouns it would be more valuable and have important if we study noun not only by its simple meaning but how other nouns are generating from it. Most of the nouns can indicate as plurals just by adding only ‘s’ or ‘es’ Noun may be common or a proper. Numbers are mostly referred to as nouns [18]. E.g. wasim bought 12 balls, he was born in 1990 etc.

2.2.2 Pronoun

Pronoun is used instead mean in place of a noun. For example, Sohail is a hard worker. He study every time. Here ‘he’ is basically a pronoun and refers to ‘sohail’ that is noun. Other pronouns are: it, they, their [19].

2.2.3 Verb

A verb is the action or sometime state of action of noun in a sentence. “A verb or compound verb actually declares something about the subject of the sentence and express actions or states of being.” A verb is basically is the key part of the sentences without it a sentence cannot be complete or give a complete sense [20]. E.g. it is sunny. Sunny is a verb.

2.2.4 Adjective

An adjective is that, modifies or give more valuable information of a noun or pronoun [21].” Actually adjective comes earlier the pronoun or the noun whose meaning it is going change. It expresses about the quality and value of the words. E.g., Anmol is a attractive girl, Here attractive is adjectives.

2.2.5 Adverb

An adverb alters the meaning of verb, adjective or other adverb. It couriers the manner, period, place or degree like how, when, where or how often. Some of the adverbs are:  now, yesterday etc.

2.2.6 Preposition

“A preposition is a word used to combine the nounspronouns, or phrases to another words inside a sentence.” A preposition typically comes earlier nouns or pronouns and it indicate the connection of nouns and pronouns with auxiliary words of the sentence. They define the position, time or connection of somewhat. For example, the pin is on the table here on is prepositions.

2.2.7 Conjunction

Conjunction is specially used to combine words, clauses phrases and sentences. Conjunction junction nouns, pronouns, verbs, adjectives, prepositions and adverbs. For example, it is raining but Sohail and Anmol are still running in it. Here ‘and’ and ‘but’ are conjunctions.

2.2.8 Determiners

Determiners are placed before a noun or noun phrase to show that what the speaker to trying to refer to and can also show the quantity or relationship of the noun. Determiner is used before noun to show which particular example of noun we are referring to in a sentence. They can also considered to be modifying words as they specify the reference that a noun has [22].

2.3 Part of Speech tagging for English

Tagging is essentially the picture given to a word in command to know that to which part of speech it fit in to. Tagging is the very basic and very most important job in Natural Language processing because of its high ambiguity. There is an uncertainty between the parts of speech of many of words. There are many taggers online which can help to do the job up to certain extent but during this targeting job we did the tagging by our own developed tagging methods and algorithms so that we know what’s basically going on [23]. For example: She < subject pronoun > wears <verb> a <determine> shabby <adjective> dress <noun>. A <determine> juggler <noun> is <verb> a <determine> boy <noun> as shown in below table2.3.

                           Table part of speech tagging for English 2.3

Parts of SpeechTagsExamples
ConjunctionConjunctionFor, and, yet, or, because but nor, so, etc…
DeterminersDetermineA, an, the, this, that…
NounNounDog, dollar , Youngster, table, etc.
NumeralsNumeralsone, 1, two 2, 5 five 5 etc
AdjectiveAdjectiveBrilliant, angry, sweet, pleasant, hollow, etc.
PrepositionPrepositionTo, in, on, after, at, by.
VerbVerbsmile, eat, catch, come, push, hit etc.
AdverbAdverbbeautifully, smartly, fast, suddenly etc.


2.4 Parts of Speech in Pashto

                                               Table for Pashto part of speech tagging 2.4

Noun      (N)    ,امریکا،ډاکټر,الوتکېهغه<PN>  یو <Det> ډاکټر<N>  <V> ده   امریکا <N> ډېر <Det> شتمن<Adj> هیواد<N> ده <V>  
Pronoun     (PN)هغه ، دوی، موږ ،وغیرہمونږ<PN> ,پارک<N> ,ته<V> روان ہوں ۔<V>  زه<PN> کرکټ<N> کړم۔<V> 
Verb (V)خوړل ، سېځل ، ګرځېدل ،وغیرہهغه <PN> پوهنتون <N> ته  <PP>زۍ<V> وقار <N> سبق <N> واې ۔<V>
    Adjective       (Adj)ښکلی ،اوږد،دروند وغیرہفريال <N> خکلی <Adj> ماشومه <N> ده ۔<V>
Adverb (Adv)تر اوسه ، لری ، په تیزی سره<Adv> پاکستان<N> ډېر  ښکلی<Adj> هيواد<N> ده۔<V> پيشو <N> ډيره <Adj> سپينا <Adj> ده ۔<V>
Conjunction (con)او, تر اوسه, يا, ځکه چې, خو, نو,جواد <N> او<con> وقار <N>  به <pp>پوهنتون <N> ته  <PP> لاړ شي  ۔<V> 

2.5 Part of Speech for Pushto

As we used a bilingual representing of words or phrases for both of the languages i.e. English to Pushto. So the tagging for both the languages is made so easy that we do not need to scratch the parts of speech for Pushto text. But instead we only need to find the valued mapped sense for the particular input as bellow in table 2.5.

Two (number)     and (conjunction) three (number) become (verb) five (number)

Dawa(number) aw (conjunction) daree (number) penza (number) shae (verb)

                                           Table for Pashto part speech 2.5  

Parts of SpeechExamples
Conjunctionاو, تر اوسه, يا, ځکه چې, خو, نو,
Nounپیښور, ميز , بسته.
Numeralsيو, 1, دوه 2, پنځه 5
   Adjective              ښه, غوسه, خواږه
Prepositionته, په, وروسته, ده.
Verbخندا, خوري, ټیله کول,
Adverbپه زړه پورې, سمه ده, ناڅاپي

In natural language processing, tagging plays an important role. It is a significant necessity for putting a human language on engineering track. Before developing a tagger, a tag set is required for that specified language.

. Algorithm for POS tagging

1 Take input text from user as a source

 2 Tokenize the input text on defined rules

3 Search the tokens in lexicon in input text

4 Get the tagged output result

5 Get those marked tokens from the tagged output result


In natural language processing, tagging plays an important role. It is a significant necessity for putting a human language on engineering track. Before developing a tagger, a tag set is required for that specified language [24].

Let’s look at the Wikipedia definition for them:

In corpus linguistics, part-of-speech tagging (POS taggingor POS tagging or POST), also called grammatical taggingor word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context — i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Types of POS taggers

POS-tagging algorithms are divided into two groups:

  • Rule-Based POS Taggers
  • Stochastic POS Taggers

E. Brill’s tagger, one of the first used English POS-taggers

The Brill’s tagger is basically rule-based tagger which finds out the set of tagging rules that correctly define the data and minimize POS tagging errors.

Rule-Based POS Taggers:

Rule-based approaches use contextual information to assign tags to unknown or ambiguous words

Stochastic Part-of-Speech Tagging

The term ‘stochastic’ can be define as any number of different approaches to the problem of POS tagging

Chapter 3    Patterns and Verifications of Phrases







Chapter 3                 Patterns and Verifications of Phrases

In this chapter we will explain chunking of POS tagged words. We will discuss the planned regular expressions.

3.1 Chunking

Chunking is a term to the process of extracting phrases from unstructured text on specific rules. So syntax chunking is actually to recognize the parts of speech in a sentence In our effort, we have planned regular expressions algorithm which chunks the words with POS tags into phrases after comma or “and” So our chunks signify the phrases [25]. Its advisable to use phrases such as “South Africa” as a single word instead of ‘South’ and ‘Africa’ separate words.


sentence = “the little yellow dog barked at the cat”

                                     Example for chunking

3.2 Phrases

Phrases are actually collected together to make a sentence.

According to the website Wikipedia, “a phrase is any collection of words, or sometimes a word, which shows a specific role within the grammatical structure of sentence.” A phrase may have a collection of words but we cannot call it a sentence because it may not give sense.

3.3 Regular Expressions RE

regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular.RE mean the module let you check if a given string matches a given regular expression [26]. We used Regular Expressions in our paper work to generate phrases from collections of identified words from the dictionary. The patterns are used to produce the phrases.

3.4 Deterministic Finite Automata DFA

 In the theory of computation Deterministic Finite Automata are to identify regular languages and to patterned that if the regular expressions are generating a legal string or not [27]. DFAs are used to confirm the patterns are valid and that they only make the phrases which are legal for our proposed algorithm. The DFA accepts if the starting state for the DFA is the start state and reaching to a last state. Deterministic Finite Automata also known as deterministic finite acceptor [28].

3.5 Noun Phrase NP

A noun phrase or nominal phrase (abbreviated NP) is a expression, in which the head is a noun (or indefinite pronoun) and the noun phrase is calm of a noun and one or more words which alter the noun. A noun phrase can also involve of a solitary word i.e. noun. A noun phrase can be a subject. Noun phrase end with noun. The noun phrase break into pieces may have a noun, possessive pronoun, adjective [29]. A noun phrase (NP) may consist of one word (for example, pronoun we or the plural noun dogs), or may contain a noun with a number of dependents.

3.6 Pronoun Phrase

In pronoun phrase (PnP), the start of expression would be a pronoun. The pronoun Phrase would have conjunction at start of the phrase and would have a subjective pronoun at the end [30]. The pronoun phrase chunking involve of POS like conjunction and subjective pronoun.

3.7 Preposition Noun Phrase

preposition is a word that show a relation between a noun, pronoun and another word in a sentence [31]. A prepositional phrase is a collection of words that lacks either verb or a subject. Preposition noun phrase (PNP) takes a preposition as its start and noun phrase as its end. In preposition Noun, the preposition arises before the noun phrase which also represent its object. The PNP only contain of two parts the preposition and the noun

phrase (NP).The chunking of Preposition Noun Phrase is encompassed of noun, conjunction, pronoun, number, adjective, preposition.

Adjective Phrase (or adjectival phrase)

Wikipedia :”An adjective phrase is a phrase the head word of which is an adjective, e.g. fond of steak, very happy, quite upset about it, etc. The adjective can initiate the phrase, conclude the phrase, or appear in a medial position”.

It’s a group of words that represent a noun or pronoun in a sentence, thus functioning as an adjective.

Phrase (e.g. very happy)

 Phrase (e.g. fond of steak

3.8 Verb Phrase VP

Words indicating tense, mood, or person. A verb phrase is the collection of one verb as its head together with other words as its dependents to create a verb. A verb phrase involves of secondary, or helping verb and a main verb.Helping Verbs “help” the main verb in to complete a verb phrase [32]. The helping verb always heads the main verb. In our paper work, we used to alter a verb by addition of adverb, reflexive pronoun, objective pronoun etc… Every sentence must have a verb verb phrase is a verb and all of its modifiers and helpers.

 3.9 Predefined Phrase

Predefined phrases are the collection of words or phrases which are already well defined in English language which cannot be produced by the patterns For example, by the way etc. These phrases have exact meanings and in real language if it explain word by word it may give other meaning but as this is predefined so we need to get track of such phrase to make correct translation.






                          Chapter 4    

     Proposed Representation and Mapping of Pashto and  

     English Text







Chapter 4    

Proposed Representation and Mapping of English and Pushto Text

In this chapter the rational representation for English and Pushto is described. It also throws explain that how the Predicate Logic PL of Pushto is used to create PL for English.

The algorithms used in our paper work and all processes are also explain in detail in this chapter.

4.1 Predicate Logic PL

Predicate Logic (PL) is actual well known formal system of logic. Predicate logic allows us to break sentences into smaller part, predicates have different uses and interpretations in mathematics and logic. Wikipedia :“ In mathematical logic, a predicate is commonly understood to be a Boolean-valued function P: X→ {true, false}, called the predicate on X ”Predicate logic is a deductive symbolic scheme which permits us to determine that the prepositions are true or false mean Boolean by assigning values to variables [33]. Usefulness of Predicate Logic for NLP Semantics, propositional logic, predicate logic allows us to decompose sentences into smaller fragments. Predicate logic handle expressions of generalization.

a. Every dog is sleeping.

 b. Some girl likes Anmol.

 (some girl)(she likes Anmol)

 for some x, x is a girl, x likes Anmol

 true if ‘she likes Anmol’ is true for at least one possible value for ‘she’

Basically Predicate logic allows us to talk about pronouns.

4.2 Mapping of Pushto to English

Mapping is the translation of source words of one language to other language in such a way that the meaning of language translation remain same. We did our mapping by taking Pushto phrases as the divide it to chunk and then mapped them with their corresponding English translation. Each of the chunks or phrases mapped from Pushto to English which are used in our paper work while to make real translation is not just chunking mapping or directly replacing word-by-word to English words but need algorithm. 

4.3 Swapping

Swapping mean interchange Value by replacing value a -> b and value b -> a.

We get the following steps for swapping.

Example:      If Pushto => (NP, PNP)

                   Then English => (PNP, NP)

So predicate logic in Pushto is stated as:

VP0 ([PnP0], NP0, PNP0). So in English it would be: VP0 ([PnP], PNP0, NP0)

4.4 Pushto Language representation in PL

Like Pushto the English language is represented in predicate logic and is retrieved from right to left unlike Pushto. Entirely the arguments of the predicate will retrieved from left to right and then at the next [23]. Argument here mean first chunk will be retrieved first.

4.4.1 Mapping of chunks to English

The chunks that is created by tagger for Pushto are mapped with their respective English translation that are already defined in our domain.

4.5 Algorithms for proposed work

The algorithms and dictionaries planned in this work are discussed below.

4.5.1 Algorithm for English Logical representation

  1. Tokenize the input text and tag each token with its corresponding POS.
  2. Produce phrases from the tagged tokens “comma” or “and” or “.”.
  3. Now break the phrases produced from point 2 to chunks
  4.  Get every chunk mean one, one word and interchange it with respective Pushto translation word.                        
  5. Apply the rules of Making English language to that tag set .
  6. Arrange the tag and produce the result.

4.5.4 Algorithm for Retrieval of English Text

  1. Read the data from the source and add it to a list.
  2. Describe the phrases in the given sentences.
  3. Apply the predicate logic representation according to given rules.
  4. Retrieve every phrase divided by “comma”,”break” or “and” or and right to left order.
  5. Divide the Phrase to Chunks.

4.5.5 Algorithm for Retrieval of English Text

  1. Read the data from the produced text by English algorithm.
  2. Swap value in the phrases in the given sentence according to defined rules.
  3. Apply the predicate logic for every phrase.
  4. Retrieve the sentences from left to right order.

                                Can use this for tag time



Chapter 5     System Translation

Chapter 5                                            System Translation

This chapter is almost about the translation complete by our machine and how the machine worked from getting input from a source (user) or a File up to producing a complete translation.

5.1 Taking input from a user

Take Pushto sentence as an input from a user or a file. The words of the sentence should be present in the data set mean database domain. A string of words is taken as an input that are already defined in our domain. Domain here mean in database from which we will tag the sentence to POS shown in below figure 5.1.

                          Figure 5.1 input from user

5.2 Penetrating test for predefined phrase PDP

If there is any PDP existing in the sentence allocate a label to it i.e. PDP.

  1. Else if PDP not present in the input string then return the original input text.
  2. If there is only one word in complete phrase
  3. According to point 4 show its corresponding value with out swiping or applying any rules on it.

5.3 Tokenization

Breaking up a sequence String into pieces like words, phrases, and other elements called tokens. Tokenization is a process which divide text into words and sentences. First step of our work that apply in translation is Tokenization, because the text needs to be segment in such a way to make it ready to our algorithm as words, phrase etc. This method of splitting up is called tokenization [35]. Tokenization is performed on the origin of the white spaces in the input text that represent the start of text. The first and last spaces are removed from text to speed process time and any double space in sentence is to be replace by only one space for minimizing processing time.

sohail goes to schoolسهيل ښوونځي ته ځي او

noun/pronoun verb proposition noun

they drink water

دوی اوبه څښل

Noun/pronoun verb and Noun/pronoun

Following are example sentences for tokenization. “I love ice cream. I also like steak.”

4.5 Chunking

Chunking is a basic term referring to the process of getting individual pieces of information (chunks) and grouping them into larger scale define pattern. During this phase of the chunking algorithm, regular expressions rules or pattern rules are applied to produce the phrases from the tagged text.

(he, هغه)(write,ليکئ) (a,یو)(letter, خط )

he write a letter هغه یو خط ليکئ

noun/pronoun verb determiner noun

5.5 PL representation for English language

“Representation is the depiction of a person, thing or idea in written, visual, performed or spoken language”

In predicate logic, sentences can be divided into words e.g. verbs nouns, and adjectives or even phrases. As there are finite numbers of words in a language, so one can store words for representing the knowledge. So rendering to the above planned knowledge representation technique, we have represented our Pushto text in predicate logic in order to patterned the retrieval correctness. The accuracy depends upon the accuracy of retrieval of the PL of given text.

5.6 Swapping specific phrases

Swapping is the most important part of our algorithm and is already described in details in above chapter. For the retrieval of accurate English sentence from input text (user,file), we need to have correct PL for pushto sentence. Swapping has the following two rules which are applied before the making of PL for English.

  1. If the PL of pushto language has an NP before a PNP, they will exchange (swap) their places during creation of PL representation for English.
    1. Example, If NPx, PNPx then PNPx, NPx
  2. If there is a PVP before an NP so swap them both.

5.7 PL representation for Pushto language

The PL for English is generated from the PL of Pushto after temporary through swapping. If the Pushto language PL don’t have NP before PNP before an NP then the PL for English will be the same as the PL for Pushto [36]. During this step the corresponding meanings for each phrase is retrieved from the bilingual corpus. The English meanings for each phrase are written in place of the Pushto phrases in the PL.

                    Figure ‎5.3: Predicate Logic representation by our system

5.8 Translation

The arguments or phrases of the PL are regained from right to left by getting their parallel meanings from the database. The change between English and Pushto is that in English the VP of the sentence would be retrieved once all the arguments of the

Figure 5.8: Predicate Logic and original translation of Pashto

function are retrieved. This retrieval gives an Pushto translation for the input text given in step one.

                    Chapter      6 conclusion

Chapter 6                                                        Conclusion

6.1 Conclusion

Sketch upon the work done in this thesis, a tagged text and a bilingual corpus of phrases. After tokenizing and labeling the sentences, the phrases are created automatically by using our proposed patterns/regular expressions clarified in ‎Chapter 3. The sentence is represented in English predicate logic. This work has also introduced swapping before representation of PL for Pushto mentioned at section ‎4.4. Finally after mapping the phrases of English into Pushto, the PL for Pushto is generated. This work has also proposed that retrieval of Pushto text from PL should be done by retrieving the phrases from left to right mentioned at section ‎4.5. The algorithm of this work is executed in web technologies to make actual web application which shows the fallouts as mentioned in section ‎5.9. The main goal of this work was to increase the effectiveness of an existing algorithm which was suggested by our supervisor Dr.Amjad Ali.

6.2 Future work

As the system is planned for RBMT, so it has the a few disadvantages. The system is planned for a limited corpus, so it will not work on words entered beyond the given corpus. If the retrieval of the English PL for an input text is accurate then it will generate a accurate translation, but if the retrieval of PL is not 100% accurate then the translation will be confusing. The use of morphology in investigating words would save space and time, instead of mapping more phrases.


[1]L. Sweeney, “That’s AI?: A History and Critique of the Field,” School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213-3890, pp. 19-23, July 2003.
[2]R. Davis, H. Shrobe and P. Szolovits, “What is Knowledge Represenatation?,” AI Magazine, pp. 8-15, 1993.
[3]R. Weischedel, J. Carbonell, B. Grosz, W. Lehnert, M. Marcus, R. Perrault and R. Wilensky, “White Paper on Natural Language Processing,” in Proc. DARPA Speech and Natural Language Workshop, Harwich Port, Massachusetts, October 1989.
[4]F. Zanettin, “Bilingual Comparable Corpora and the Training of Translators,” vol. 43(4), pp. 616-630, December 1998.
[5]N. Mandelblit, “Machine Translation: A Cognitive Linguistics Approach,” in Proceedings of the 5th Int. Conf. on Theoretical and Methodological Issues in Machine, Kyoto, Japan, 1993.
[6]N. Chomsky, Knowledge of Language: Its Nature, Origin, and Use, USA: Praeger Publishers, 521 Fifth Avenue, New York, NY 10175, 1986.
[7]B. J. Dorr, L. S. Levin and E. H. Hovy, “Machine translation: interlingual methods,” in Brown K (eds) Encyclopedia of language and linguistics, Elsevier, Oxford, UK, 2004.
[8]B. Vauquois, “A Survey of Formal Grammars and Algorithms for Recognition and Transformation in Machine Translation,” in Proceedings of the IFIP Congress-6, Edinburgh, 1968, pp. 254-260.
[9]K. Imamura, H. Okuma, T. Watanabe and E. Sumita, “Example-based Machine Translation Based on Syntactic Transfer with Statistical Models,” in Proceedings of COLING, Geneva, Switzerland, pp.99-105, 2004.
[10]A. Lavie, K. Probst, E. Peterson, S. Vogel, L. Levin, A. Font-Llitjos and J. Carbonell, “A Trainable Transfer-based Machine Translation Approach for Languages with Limited Resources,” in Proceedings of Workshop of the European Association for Machine Translation (EAMT-2004), Valletta, Malta, 2004.
[11]S. Tripathi and J. K. Sarkhel, “Approaches to machine translation,” Annals of Library and Information Studies, vol. 57, pp. 388-393, December 2006.
[12]O.-W. Kwon, S.-K. Choi, K.-Y. Lee, Y.-H. Roh and Y.-G. Kim, “English-Korean Patent Translation System: FromTo-EK/PAT,” in MT Summit XI Workshop on Patent Translation Program, Copenhagen, Denmark, September 11, 2007.
[13]Omniscien Technologies, [Online]. Available: [Accessed 10 September 2018].
[14]D. Turcato and F. Popowich, “What is Example-Based Machine Translation?,” in Carl and Way (2003), 2003, pp. 59-81.
[15]J. Ma, “Automata in Natual Language Processing,” Technical Report 0834, Laboratoire de Recherche er Developpement, l’Epita, France, December 2008.
[16]R. Emmanuel and S. Yves, “Introduction to Finite-State Devices in Natural Language Processing,” 1 January 1996.
[17]G. Jirásková and M. Palmovský, “Kleene Closure and State Complexity,” ITAT 2013 Proceedings, CEUR Workshop Proceedings, vol. 1003, pp. 94-100, 2013.
[18]P. Schachter and T. Shopen, “Parts of speech systems,” in Language Typology and Syntactic Description: Clause Structure, vol. 1, Cambridge University Press, October 2007.
[19]M. Stavrou and A. Terzi, “Types of Numerical Nouns,” in Proceedings of the 26th West Coast conference on formal linguistics. eds. Charles B. Chang and Hannah J. Haynie, Somerville, MA: Cascadilla, 2008.
[20]G. Kleiser, “What is a verb?,” in Exploring English Grammar, New Delhi, APH Publishing Corporation, 2008, p. 2.
[21]J. Walter, “Building Writing Skills the Hands-on Way,” in Building Writing Skills the Hands-on Way, Boston, MA 02210 USA, Cengage Learning, 2016, pp. 165-171.
[22]D. Veselka, English Articles and Determiners, Independently published, 2017.
[23]Evaluation Report on the Primary Pashto Text-Books Translation Project Peshawar: Education Department, Govt. of the N.W.F.P. Ewald, H .1839.  Utmanzai, Charsadda. Guide .1990. Da Primary Ustazano Rahnuma Guide: Pakhto [Pashto: Primary School Teachers’ Guide] Peshawar: Primary Text Book Translation Project
[24]Abrar, Sayedul .1979. ‘An Appraisal of the Work of Pashto Academy at Peshawar’, Journal of Central Asia Vol. II; No. 1 (July), 89-106.
[25]S. Saxena, R. Raperya and N. K. Malik, “MACHINE LEARNING USING CHUNKING,” Iternational Journal of Advance Research in Science and Engineering, vol. 6, no. 02, pp. 285-292, February 2017.
[26]M. Hamada and S. Sato, “A Game-based Learning System for Theory of Computation Using Lego NXT Robot,” in International Conference on Computational Science, ICCS 2011, Procedia Computer Science 4 (2011), 2011, pp. 1944-1952.
[27]D. Ather, R. Singh and V. Katiyar, “Simplifying Designing Techniques: To Design DFA that Accept Strings over ∑= {a, b} having at least x Number of a and y Number of b,” International Journal of Computer Applications (0975 – 8887), vol. 91, no. 07, pp. 12-17, April 2014.
[28]S. Wason, S. Rathi and P. Kumar, “RESEARCH PAPER ON AUTOMATA,” 2014 IJIRT, ISSN: 2349-6002, vol. 1, no. 5, pp. 507-510, 2014.
[29]W. M. Soon, H. T. Ng and D. C. Y. Lim, “A Machine Learning Approach to Coreference Resolution of Noun Phrases,” Association for Computational Linguistics, vol. 27, no. 4, pp. 521-543, 2001.
[30]M. Christopher D and S. Hinrich, “Foundations of Statistical Natural Language Processing,” Cambridge, MA: The MIT Press, 1999, vol. 26, no. 2, 1999.
[31]C. L. Vitto, “Verb Phrase,” in Grammer by Diagram Second Edition, Broadview Press , November 2008, pp. 26-30.
[32]H. Broekhuis, H. Broekhuis and R. Vos, “PP-complements (prepositional objects),” in Syntax of Dutch Verbs and Phrases Volume 1, Amsterdam, Amsterdam University Press, 2015, pp. 284-321.
[33]K. H. Rosen, “Predicates and Quantifiers,” in Discrete Mathematics and its Applications Seventh Edition, 1221Avenue of the Americas, NewYork, McGraw-Hill, a business unit of The McGraw-Hill Companies, Inc., 2012, pp. 37-40.
[34]P. Bakliwal, D. V V and C. V. Jawahar, “Align Me : A framework to generate Parallel Corpus Using OCRs & Bilingual Dictionaries,” in Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing, pages 183–187, Osaka, Japan, December 2016.
[35]J. J. Webster and C. Kit, “TOKENIZATION AS THE INITIAL PHASE IN NLP,” in PROC. OV COLING-92, NANTES, pp. 1106-1110,AUG. 23 28, 1992, Nantes, 1992.
[36]A. Amjad and M. A. Khan, “Selecting Predicate Logic for Knowledge Representation by Comparative Study of Knowledge Representation Schemes,” in International Conference on Emerging Technologies, IEEE, pp.23-28, 2009.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *


Enjoy this blog? Please spread the word :)