Machine Translation from English to Urdu using Predicate Logic

Abstract

Machine translation (MT) refers to the automatic translation of one natural language (source) to another language (target) by using computers. MT is the subfield of Natural language processing. Natural languages are ambiguous, so the goal is not to produce grammatically correct translation but it should have at least an identical sense as the source language. In this research work a Rule Based System or Rule Based Machine Translation is used. In this work we have taken a phrase based bilingual dictionary of English and Urdu languages. In this system we have taken an input text of English and tried to find some predefined phrases. These tagged words are combined according to certain rules for making phrases. Then these phrases of English language are represented in predicate logic and later on mapped to Urdu phrases for the retrieval of Urdu language translation. The final step of the system is the translation from English to Urdu. The actual job is to make an efficient algorithm for English to Urdu Translation is implemented in a web application and has shown % accuracy.

Keywords

English to Urdu Translation, Phrase based translation, Chunking using Regular expressions, Predicate logic for English and Urdu, Knowledge representation, English to Urdu Translation

1.           Introduction

MT is an automated translation of text from a source language to a target language. MT is not the translation of words from a source language to a target language using only dictionaries but in fact, it is the translation of source text to its equivalent target text while utilizing some rules of linguistics. MT uses linguistics to translate a text of one language to an equivalent text of another language. MT is not just a word-to-word translation, but instead it needs expertise in grammar of both source and target languages [1]. Machine translation methodologies are divided in three categories, i.e. direct, transfer, and interlingual, on the basis of which source language is translated into target language [2]. Directtranslation is a method of word-to-word translation using dictionary. Morphological analysis is involved for lookup of words in the dictionary. The linguistic representation and determining of grammatical structure are not needed for the implementation of this translation methodology. The transfer methodology composed of syntactic and semanticof both source and target languages. During syntactic transfer, the sentences are parsed into parts of speech for each word and mapping the source language nodes (words or phrases) to the target language nodes using transfer rules [3]. The semantic transfer is done while having a representation of source language in a knowledge representation scheme, which consist of a series of structures having a meaning in the target language [4]. It needs monolingual dictionaries of the two languages. The interlinguabased machine translationis actually a representation of source language which results into an equivalent translation in the target language.

There are different approaches to MT which include Rule based machine translation (RBMT), Dictionary based machine translation, Statistical machine translation (SMT), Example based machine translation (EBMT). RBMT is used to translate a source language into a target language by applying morphological, syntactic and semantic analysis. RBMT applies linguistic rules to a large amount of data in three phases i.e. analysis, transfer and generation [5]. Dictionary based machine translation is based on dictionary entries that words can be translated into other words by using bilingual dictionaries. . Analysis of bilingual text corpora (source and target languages) and monolingual corpora (target language) generates statistical models that transform text from one language to another with that statistical weights are used to decide the most likely translation [6]. EBMT system uses bilingual corpus while having a parallel or point to point mapping of words at the run-time [7].

Automata Theory is the study of abstract machines which are automatic and are able to solve computational problems. Actually automata theory makes the use of mathematical models to perform automatic operations. Automata is the backbone of modern computing and have lots of applications such as computer languages, compilers, Regular Expressions, Artificial intelligence and hardware design etc. Automata theory is very important in Natural Language Processing (NLP) for the representation of large scale dictionaries and indexation of natural language text [8]. Finite state automata is used in large domains such as pattern matching, pattern recognition, handwriting recognition etc. to name a few [9].

2.           Background/literature review

Machine translation basically comprises of three parts i.e., lexicon, grammar rule, and interpretation program [10]. The problem is how to separate the grammar rules from the interpretation program that will make the system a language independent. To overcome these problems of MT, researchers have focused on research in the area of knowledge representation in NLP. Predicate Logic is one of the best knowledge representation scheme that represents knowledge easily and efficiently as compared to other knowledge representation techniques [11]. It is important to use such a knowledge representation approach that enables computers to use knowledge easily and efficiently. Knowledge can be represented easily, efficiently, and accurately in predicate logic as compared to other knowledge representation approaches. This scheme has the capability to develop an efficient knowledge understanding and generation system for machine translation from English text to Urdu text.

Predicate logic uses the concept of function, predicate, variable, constant, logical connectives and quantifiers to for representing facts of a natural language [12].It is expressive enough to represent knowledge [8]. It is a standard knowledge representation scheme to develop an expert system [9]. It represents knowledge in finer detail [10]. The input text can be split into words or even phrases and then these words or phrases are represented in predicate logic. Knowledge comprises of realities, ideas, speculations, strategies and relationship. It is additionally data that has been composed and examined to make it reasonable and material to critical thinking or basic decision making [6]. It is a subarea of artificial intelligence concerned about understanding, designing, implementation of methods for representing information in computer, and to infer new information in light of the represented information [11]. It plays a vital role in computational linguistic. Knowledge representation has the trademark like expressive, brief, unambiguous [12]. In computer, knowledge should be represented in an effectively and precisely manner for processing and retrieval. It is used in core areas of computational linguistic such as question-answering, machine translation, information retrieval, and information extraction. In knowledge representation, the fundamental issue is the development of precise notations for representing knowledge. Such notations are used in knowledge representation techniques [13].

A procedure used to divide a sentence into portions constituents/phrases by means of part of speech is called chunking. In chunking the key phrases or constituents of the sentence such as Noun Phrase, Verb Phrase etc are identified which establish/makes certain sense [14]. It is used for information retrieval, information extraction, text summarization, bilingual alignment, computational linguistics tasks [15]. According to Abney [16], the English sentences are divided and then these sentences are read chunk by chunk. He presented a novel approach of parsing which is called partial parsing in which the English sentences are parsed into chunks/phrases having proper tags for the chunks phrases [16]. Representation of knowledge in phrases/chunks is very efficient as it provides precision and accuracy [17, 18]. Instead of single words, the word sequence/phrases in the representation is effectively used in MT. The phrase level representation provides more robustness in word selection and reordering of local words in MT. In English-Danish machine translation, a significant important has been achieved using this approach [19]. Source language text is represented in phrases, these phrases are converted into phrases of target language and these target language phrases are reordered for MT.  Phrase based representation and translation is efficient in developing Urdu-to-English, Arabic-to-English, and Chinese-to-English machine translation systems [20]. In phrase based translation, certain rules are used to divide sentences of source text into phrases which are then used to represent the whole text of a source language.

In phrase level approach, as all source language text is represented without missing any word so the whole target language text can retrieved easily and accurately from such representation. Accuracy is the main advantage of phrase based approach. In our research work, we have also identified various types of phrases/chunks using linguistic rules and then these are efficiently represented in predicate logic for machine translation from Urdu text into English text. In the light of literature review, predicate logic is used for the knowledge representation scheme for the machine translation of Urdu text into English text.

3.           English and Urdu Parts of Speech tags for proposed work

Tagging is actually the description given to a word in order to know that to which part of speech it belongs to. Tagging is the most important job in Natural Language processing because of its ambiguity. There is an uncertainty about the parts of speech of a lot of words. There are a lot of taggers online which can do the job up to some extent but during this work we did the tagging by using our own tagging methods and algorithms. For example: She <subj_pron> wears <verb> a <det> shabby <adj> dress <noun>. A <det> juggler <noun> is <verb> a <det> boy <noun>.

Table 1: Parts of speech tag set for English

Parts of SpeechTagsExamples
ConjunctionconjAnd, because, but
DeterminersDetA, an, the, this, that, every, all
NounnounPerson, feats, Juggler, figure, Boy, Pakistan, flute etc.
Possessive Pronounposs_pronHis, your.
Subjective Pronounsub_pronHe, she, it, they, you, we,they.
Objective Pronounobj_pronHim, her, us.
Reflexive Pronounrefl_pronMyself, himself.
NegationnegNo.
Numeralsnumone, 1
AdjectiveadjWonderful, beautiful, red, nice, empty, shabby etc.
PrepositionppTo, in, on, after, at, by.
Verbverbamuses, see, pleased, goes, going, wears etc.
Adverbadvbefore, quickly, follow, can etc.

The parts of speech POS defined above for English are enough to learn those for Urdu. But in Urdu there is no idea of Preposition instead it uses Postposition [10]. Parts of speech for Urdu declared in our work are given in Table 2 [11].

Table 2: Parts of speech for Urdu

POSAcknowledgementExample
Noun (N)لڑکا،گاڑی،سوات،وغیرہ <V> ہے<N> لڑکا<Adj> ایک<PN> وھ سوات<N> بہت<Adv> خوبصورت<Adj> جگہ<N> یے۔<V>
Pronoun (PN)وہ،میں،تم،وغیرہوہ <PN> سکول<N> جارہا<V> ہے۔<V> میں<PN> پڑھتا<V> ہوں۔<V>
Verb (V)تا ہے، تی ہے، تھا، تھی، تھے، رہا، رہی ،وغیرہوہ<PN> کھانا<N> کھاتا<V> ہے۔<V> بارش <N> ہورہی<V> ہے۔<V>
Adjective (Adj)خوبصورت، سرخ ،وغیرہاس<PN> نے<PP> سرخ<Adj> چادر<N> اوڑھی<V> ہے۔<V>
Adverb (Adv)تقریبا، اہستہ، کیوں، یوں، کیسے، ضروراحمد  <N>نے<PP> آہستہ<Adv> دروازہ<N>  کھولا<V>۔
Postposition (PP)کا، کو، نے، سے، کی، کے، پر، وغیرہیہ<PN> جواد<N> کا<PP> گھر<N> ہے۔<V> واجب<N> اس<PN> گاڑی<N> سے<PP> اترا۔<V>
Conjunction (con)یا، اور، کیونکہ، کہواجب<N> اور<con> جواد<N> بازار<N> جا<V>چکے ہیں۔<V>

As we used a bilingual mapping of words or phrases for both the languages i.e. English to Urdu. So the tagging for both the languages is made so easy that we do not need to scratch the parts of speech for Urdu text. But instead we only need to find the respected mapped meaning for the given input. The words given in the examples below are tagged by using Table 2.

4.           Proposed Regular Expressions for Phrases retrieval

Chunking is a term referring to the process of taking tagged words and grouping them into larger units called the phrases [12]. So in linguistics chunking is actually to identify the parts of speech in a sentence and then make higher grammatical units from them such as noun groups or verb groups or phrases etc. In this work, we have designed regular expressions which chunks the POS tagged words into phrases. Each of the phrase is then represented by its specific symbol such as NP, VP, PNP etc. which are explained in the details later in this chapter. The chunked words of English sentences are represented by their respective phrases symbols which are explained in the example below. For example: Jawad and Wajib are studying at the University of Swat. This sentence is chunked into phrases as [NP0 Jawad and Wajib NP0] [VP0 are studying VP0] [PNP0 at the University PNP0] [PNP1 of Swat PNP1]. The chunking of phrases from their POS tags are explained below individually. The numbers at the end of the phrases symbols indicated that a specific phrase is encountered which starts from symbol “0” and towards the end the indication number plus 1 would give us the number of phrases encountered in the sentence. Here NP0, VP0 and PNP0 indicates the first noun phrase, verb phrase and prepositional noun phrase respectively, while the PNP1 represents the second prepositional phrase of the sentence.

A phrase is any group of words, or sometimes a word, which plays a particular role within the grammatical structure of sentence. A phrase may have a group of words but we cannot call it a sentence because it may not give a complete meaning or sense and is not grammatically a correct sentence. Phrases are actually grouped together to make a sentence. There are seven types of phrases being used in this work for English text. Noun Phrase NP, Pronoun Phrase PnP, Prepositional Noun Phrase PNP, Adjective Phrase AP, Verb Phrase VP, Prepositional Verb Phrase PVP, Predefined Phrase PDP.

We used Regular Expressions (RE) in our work to generate phrases from groups of tagged words. A regular expression is an algebraic formula whose value is a pattern consisting of a set of strings, called the language of the expressions [13]. The patterns of REs are used to generate the phrases from the tagged words of the source text. The alphabet represented by “Σ” in the REs would be a set of POS tag set from Table 1. The patterns of RE are used to generate valid phrases. There are 4 types of operators being used in our RE. “ +, . , +, * ”. For Example: For an alphabet Σ = {a, b} having a pattern R.E = (b* .a + a+ .b)+  will generate an infinite Language L = {a, ba, bba, ab, aab, aaab, …}

For example we are taking below 5 sentences from which we will try to get the phrases by using our proposed regular expressions.

  1. We are pleased to see him.
  2. A cool boy play cricket.
  3. A juggler is a common person.
  4. She wears a shabby dress.
  5. He has a flute with him.

Deterministic Finite Automata (DFA) is used to recognize regular languages and to check that if the regular expressions are creating a valid string or not [14]. In our work, the DFAs are used to verify the patterns are valid and that they only generate the phrases which are valid for our proposed algorithm. The DFA accepts phrases if the starting state for the DFA is the start state and reaching to a final state when the entire string is consumed [15].

A noun phrase (NP) is a phrase, in which the head or nucleus is a noun. In our work, the noun phrase is composed of a noun or one or more words with the noun. A noun phrase can be a subject, direct/indirect object [16]. The noun phrase chunking may have a noun, determiner, possessive pronoun, negation, number and adjective. The regular expressions and DFA for noun phrase are explained below. For RE consider the alphabet ∑ = {conj,det,poss_pron,neg,num,adj,noun}

The Pattern/RE for NP is: RE = (conj+det+poss_pron+neg+num)*(adj)*(noun)*

For Example: In the above sentences we, a cool boy, a juggler, a common person, she, a shabby dress, a flute, are the noun phrases. The DFA used to verify the validation of the proposed regular expression is shown in Figure 1 below.

Figure 1: DFA for noun phrase NP

In pronoun phrase (PnP), the head of the phrase would be a pronoun. The PnP would have conjunction as optional at the start of the phrase and would have a subjective pronoun at the end of it. The pronoun phrase chunking consist of POS like conjunction and subjective pronoun. The subjective pronouns are explained in Table 1. The following sentence used in this research work for chunking of PnP. The PnP chunk in the above sentence is: PnP0 (and he). For RE of PnP the alphabet is : ∑ = {conj, subj_pron}

The proposed RE/pattern for pronoun phrase is: RE = (conj)*(subj_pron)

For example: In the above sentences he and she are the pronoun phrases. The DFA used to verify the validation of RE for PnP is shown in Figure 2.

Figure 2: DFA for Pronoun Phrase PnP

Preposition noun phrase (PNP) has a preposition as its head and noun phrase as its complement [17]. In PNP, the preposition comes before the noun phrase which also represent its object. The PNP only consist of two parts, the preposition and the noun phrase (NP).The chunking of PNP is comprised of preposition, conjunction, determiner, possessive pronoun, negation, number, adjective, noun and objective pronoun. The parts of speech used in PNP are shown in the set of the alphabet. So, the alphabet for PNP is given follow. Σ ={con, det, poss_pron, neg, num, adj, noun, pp, obj_pron}. The proposed RE for PNP is: RE = (pp).(NP) + (pp).(obj_pron)

For example: In the above sentences “with him” is a prepositional noun phrase, i.e. with[pp] him[NP], so here PNP = pp.NP. The DFA shown in Figure 3 is used to verify the proposed RE for prepositional noun phrase.

Figure 3: DFA for Prepositional Noun Phrase

Adjective Phrase (AP) consist of an adjective as the head word of the phrase. The adjective phrase chunks is composed of one or more adjectives and a conjunction. The adjective phrase must contain at least one adjective. The conjunction in the phrase is optional and can be generated after an adjective is generated in the phrase. The parts of speech used in the adjective phrase are included in the alphabet set of the language which is shown as follow. Σ = {adj, conj}. The proposed RE for AP is as follow. RE  = (adj) (conj+adj)*

For example: Consider a sentence as an example “the cloth was soft and silky”. After tagging the sentence with POS, the resultant tagged sentence is shown as follows. [The -> det] [cloth -> noun] [was -> verb] [soft -> adj] [and -> conj] [silky -> adj]. The “very beautiful” is an adjective phrase and the regular expression generated the RE as follows. So, AP = adj.conj.adj. The DFA for AP is shown in Figure 4.

Figure 4: DFA for Adjective Phrase

A verb phrase (VP) composed of one verb as its head including other words as its dependents. A verb phrase consists of an auxiliary verb or helping verb and a main verb. The helping verb always precedes the main verb [18]. In our work, we used to modify a verb by adding adverb, reflexive pronoun, objective pronoun, and negation. The pronouns used in the VP gives a collective meaning while having verb as the root word. The auxiliary and the helping verbs are treated as simple verbs in this work. They are tagged as simple verbs of the sentence. The POS used in generating the RE are shown in the set of the alphabet i.e. Σ = {adv, verb, refl_pron, obj_pron, neg}. The proposed RE for VP is: RE = (adv)*(verb)(refl_pron+obj_pron+neg)*(verb+adv)*

For example: In the above sentences “are pleased, play, is, wears, has” are the verb phrases. The DFA used to verify the validation of the proposed regular expression is shown in Figure 5.

Figure 5: DFA for verb phrase

A prepositional verb phrase (PVP) is an idiomatic expression that combines a verb and a preposition to make a new verb with a distinct meaning [19]. In this work, the PVP is determined by having a preposition followed by a verb phrase. Most the time, the preposition would be a word “to”. The alphabet for PVP is: Σ = {To,adv,verb,refl_pron,obj_pron,neg,noun}. The proposed RE/pattern for PVP is: RE = (To)(VP) + (To)(verb)(noun)

For example: In the above sentences “to see him” is the verb phrase. To[pp] and see him[VP], so here PVP = pp.VP

The DFA used to verify the validation of the proposed regular expression is shown in Figure 6.

Figure 6: DFA for prepositional verb phrase

Predefined phrases (PDP) are actually the group of words which cannot be generated by the patterns because in English language these phrases are already defined. For example, as for as, by the way, as well as etc. These are the phrases which are defined before in English language and our RE collects these words before identifying any phrases in the sentence. The predefined phrases are collected before any operation for POS in the sentence.

5.           Chunking of English sentences and their PL representation

Predicate logic (PL) is a deductive symbolic system which allows us to determine that the propositions are true or false by assigning values to variables [20]. Depending on the approach of the algorithm, predicate logic splits the sentences into parts of speech or phrases. Predicate logic actually uses constants, variables, predicates and logical connectives to represent the knowledge [21]. In our work, we used predicate logic to represent sentences. The VP is used as a function of the predicate while all the other phrases are used as arguments of the function. During retrieval the first argument is retrieved, then the VP and then the rest of the arguments are retrieved. The arguments in the square brackets [] are retrieved as one argument in this research work.

For example: A juggler is a common figure in Pakistan. The phrases in the sentence are NP0=A juggler, VP0=is, NP1=a common figure, PNP0=in Pakistan. The logical structure of the sentence is VP0(NP0, NP1, PNP0) and its structural representation is: is([a juggler], a common figure, in Pakistan).

In English, the sentence is represented in predicate logic in such a way that the verb phrase present the function while the other phrases represent the arguments of the function. The representation is done in such a way that during retrieval, the first argument of the function is retrieved and then verb phrase and then the remaining arguments in this manner from left to right. If the first argument of the sentence is empty [] then the VP will be retrieved first. For example, let’s take five sentence, make their chunks/phrases and then represent them in predicate logic.

  • We are pleased to see him.
  • A cool boy play cricket.
  • A juggler is a common person.
  • She wears a shabby dress.
  • He has a flute with him.

The proposed chunking for the above sentences according to our corpus in Table 1 is given as follow.

  1. [PnP0 we PnP0] [VP0 are pleased VP0] [PVP0 to see him PVP0]
  2. [NP0 a cool boy NP0] [VP0 play VP0] [NP1 cricket NP1]
  3. [NP0 a juggler NP0] [VP0 is VP0] [NP1 a common figure NP1]
  4. [PnP0 she PnP0] [VP0 wears VP0] [NP0 a shabby dress NP0]
  5. [PnP0 he PnP0] [VP0 has VP0] [NP0 a flute NP0] [PNP0 with him PNP0]

The five sentences above are represented in PL structure follow.

  1. VP0 ([PnP0], PVP)
  2. VP0 ([NP0], NP1)
  3. VP0 ([NP0], NP1)
  4. VP0 ([PnP0], NP0)
  5. VP0 ([PnP0], NP0, PNP0)

The PL representation for the above sentences is given as follow.

  1. are pleased([we], to see him)
  2. play([a cool boy], cricket)
  3. is([a juggler], a common person)
  4. wears([she], a shabby dress)
  5. has([he], a flute, with him)

6.           From PL of English to PL of Urdu

Before presenting the predicate logic structural representation for Urdu language we need to do the following steps.

  1. If there is a noun phrase NP just before a prepositional noun phrase PNP in the PL of English then swap them with each other for obtaining Urdu PL representation.

For Example:           If English PL has phrases in the order as shown as follow. VP (NP,  PNP)      

Then Urdu representation both the phrases PNP and NP are swapped/interchanged with each other as shown as follow. VP (PNP, NP)

Taking statement above 5 English sentences whose predicate logic in English is stated as: VP0 ([PnP0], NP0, PNP0). So in Urdu it would be: VP0 ([PnP], PNP0, NP0). The PL representation from English to Urdu is shown in Figure 7.

  • If there is a PVP before an NP in English PL representation then swap both of them with each other like presented above in step (a) in for getting PL representation for Urdu.

Figure 7: Swapping English phrases for Urdu PL representation

Mapping is the translation of root words of one language to all the possible translations in other language [21]. We did our mapping by taking English phrases as the root words and mapped them with their corresponding Urdu translation. Each of the chunks or phrases mapped from English to Urdu which are used in our work are shown in Table 3.

Table 3: English chunks mapping to respective Urdu chunks

English chunksUrdu chunksEnglish chunksUrdu chunksEnglish chunksUrdu chunks
isہےand sheاور وہcontainsشامل ہیں
itیہand heاور وہthe articlesمضامین
areہوin Pakistanپاکستان میںof his tradeاسکے کاروبار  کے
youتُمcool boyاچھا لڑکاhasرکھتا ہے
thisیہcool girlاچھی لڑکیa fluteایک بانسری
this boyیہ لڑکاnice boyاچھا لڑکاand small drumاور چھوٹی ڈھول
this girlیہ لڑکیnice girlاچھی لڑکیwith himاپنے ساتھ
is notنہیں ہےthe cool boyاچھا لڑکاbefore showingظاہر کرنے سے پہلے
a jugglerایک جادوگرthe cool girlاچھی لڑکیhis tricksاپنی چالوں کو
and a jugglerاور ایک جادوگرthe nice boyاچھا لڑکاplaysکھیلتا ہے
jugglerجادوگرthe nice girlاچھی لڑکیon his fluteاپنی بانسری
a common figureایک عام شخصa cool boyایک اچھا لڑکاand drumاور ڈھول
a common personایک عام شخصa cool girlایک اچھی لڑکیto attract peopleناظرین کی توجہ دلانے کیلئے
common figureعام شخصa nice boyایک اچھا لڑکاgoesجاتا ہے
in every townہر شہر میںa nice girlایک اچھی لڑکیgroundگراونڈ
of pakistanپاکستان کےa juggler boyایک جادوگر لڑکاtoکو
boyلڑکا?؟is goingجارہا ہے
a boyایک لڑکاplayکھیلتا ہےweہم
a girlایک لڑکیcricketکرکٹallسب
the boyلڑکاdoesکیاare pleasedخوش ہیں
the girlلڑکیtoکوto see himاُسکو دیکھنے کیلئے
heوہto playکھیلنے کیلئےbecause heکیونکہ وہ
sheوہcameآیاamuses usہم کو آمادہ کرتا ہے
girlلڑکیa shabby dressایک پٹھی پُرانی لباسwith his wonderful featsاپنی کرتبوں سے
wearsپہنتی ہےcarriesرکھتا ہےto himاس کو
dressلباسa bagایک بیگby the wayبہرحال
andاورwhichجسمیں  

Like English the Urdu language is represented in predicate logic and is retrieved from left to right. The main difference for English is that the VP will be retrieved at the end of the retrieval process for Urdu. All the remaining arguments of the predicate will be retrieved from left to right as shown in Table 3. If the first argument is empty [] then the second argument will be retrieved first. Taking the above English sentences as examples and then after retrieving the information we will see the required results below.

Like English the Urdu language is represented in predicate logic and is retrieved from left to right. The main difference for English is that the VP will be retrieved at the end of the retrieval process for Urdu. All the remaining arguments of the predicate will be retrieved from left to right as shown in Figure 7. If the first argument is empty [] then the second argument will be retrieved first. We are taking the sentences above 5 English sentences as examples and then after retrieving the information we will see the required results below.

The chunks presented for English above are mapped with their respective Urdu translation in the bilingual corpus and is stated as under for the five English sentences.

  1. [PnP0 ہم PnP0] [VP0 خوش ہیںVP0] [PVP0 اُسکو دیکھنے کیلئےPVP0]
  2. [NP0 ایک اچھا لڑکاNP0] [VP0 کھیلتا ہے VP0] [NP1 کرکٹ NP1]
  3. [NP0 ایک جادوگر NP0] [VP0 ہے VP0] [NP1 ایک عام شخصNP1]
  4. [PnP0 وہ PnP0] [VP0 پہنتی ہے VP0] [NP0 ایک پٹھی پُرانی لباسNP0]
  5. [PnP0 وہ PnP0] [VP0 رکھتا ہے VP0] [NP0 a flute NP0] [PNP0 اپنے ساتھPNP0]

After the required swapping of phrases in the English predicate logic the structural representation is shown as under:

  1. VP0 ([PnP0], PVP)
  2. VP0 ([NP0], NP1)
  3. VP0 ([NP0], NP1)
  4. VP0 ([PnP0], NP0)
  5. VP0 ([PnP0], PNP0, NP0)

The predicate logic representation of Urdu phrases is shown as under:

  1. خوش ہیں ([ہم], اُسکو دیکھنے کیلئے)
  2. کھیلتا ہے ([ایک اچھا لڑکا], کرکٹ)
  3. ہے ([ایک جادوگر], ایک عام شخص)
  4. پہنتی ہے ([وہ], ایک پٹھی پُرانی لباس)
  5. رکھتا ہے ([وہ], اپنے ساتھ, ایک بانسری) 

By using our logic for retrieval of Urdu language as presented above the required sentences are as follow.

  1. ہم اُسکو دیکھنے کیلئے خوش ہیں
  2. ایک اچھا لڑکا کرکٹ کھیلتا ہے
  3. ایک جادوگر ایک عام شخص ہے
  4. وہ ایک پٹھی پُرانی لباس پہنتی ہے
  5. وہ اپنے ساتھ ایک بانسری رکھتا ہے

7.           Proposed Algorithms

7.1.       Algorithm for English PL representation

The proposed algorithm for English text has the following steps.

  1. Tokenize the input text and tag every token with its respective POS.
  2. Generate phrases from the tagged tokens.
  3. Make the first VP of the sentence as the predicate and place all the other arguments as the parameters for the predicate.
  4. If there are more than one VP in the text
  5. Use another predicate inside the first one.
  6. Retrieve the second predicate like a simple predicate after the retrieval of the first predicate.
  7. If VP comes at the start of the sentence then place empty brackets as the first parameter of the predicate.
  8. On having more than a single word in the parameter, use a space between the words.

7.2.       Algorithm for swapping concept

  1. Use the predicate logic for English and analyze the representation of the parameters.
  2. If an NP just come before a PNP and PVP before an NP,
  3. Swap them with each other during the logical representation for Urdu.

7.3.       Algorithm for Urdu logic representation

  1. Represent the predicate logic for Urdu after performing the necessary swapping of phrases in English PL representation.
  2. Place VP as the predicate and all the other arguments as the parameters from left to right.
  3. On off the chance that the arguments have more than one VP,
  4. Make a predicate in the parameters in a left to right manner.
  5. Retrieve it after the retrieval of first predicate.
  6. If VP comes at the start of the sentence then place empty brackets as the first parameter of the predicate.

7.4.       Algorithm for mapping

  1. Grab the data from the local files which are already defined.
  2. Create the mapping rules for the sentences.
  3. Apply some mapping of the phrases present in English to their corresponding Urdu phrase.
  4. Retrieve the text for Urdu and present it in the Predicate logic.
  5. Extract the mapped sentences.

7.5.       Algorithm for Urdu language retrieval

  1. Grab the data from the file.
  2. Define the phrases in the given sentence.
  3. Apply the predicate logic for each sentence.
  4. Retrieve the sentences from left to right order while retrieving the VP at the last of the parameters.

The Diagrammatic representation of the algorithms explained in this section is shown in Figure 8.

Figure 8: Diagrammatic representation for proposed Algorithm

8.           Conclusion and future work

Drawing upon the work done in this thesis, a tagged text and a bilingual corpus of phrases mentioned in Table 1 and Table 3 respectively has been taken. After tokenizing and tagging the sentences, the phrases are generated automatically by using our proposed patterns/regular expressions. The sentence is represented in English predicate logic. This work has also introduced swapping before representation of PL for Urdu. Finally after mapping the phrases of English into Urdu, the PL for Urdu is generated. This work has also proposed that retrieval of Urdu text from PL should be done by retrieving the phrases from left to right. The algorithm of this work is implemented in web technologies to make real web application which shows the results.

As the system is designed for RBMT, so it has the a few drawbacks. The system is designed for a limited corpus, so it will not work on words entered beyond the given corpus. If the retrieval of the English PL for an input text is accurate then it will generate a correct translation, but if the retrieval of PL is not 100% accurate then the translation will be ambiguous. The use of morphology in analyzing words would save space and time, instead of mapping more phrases.

9.           References

[1]F. Zanettin, “Bilingual Comparable Corpora and the Training of Translators,” vol. 43(4), pp. 616-630, December 1998.
[2]B. J. Dorr, L. S. Levin and E. H. Hovy, “Machine translation: interlingual methods,” in Brown K (eds) Encyclopedia of language and linguistics, Elsevier, Oxford, UK, 2004.
[3]K. Imamura, H. Okuma, T. Watanabe and E. Sumita, “Example-based Machine Translation Based on Syntactic Transfer with Statistical Models,” in Proceedings of COLING, Geneva, Switzerland, pp.99-105, 2004.
[4]A. Lavie, K. Probst, E. Peterson, S. Vogel, L. Levin, A. Font-Llitjos and J. Carbonell, “A Trainable Transfer-based Machine Translation Approach for Languages with Limited Resources,” in Proceedings of Workshop of the European Association for Machine Translation (EAMT-2004), Valletta, Malta, 2004.
[5]O.-W. Kwon, S.-K. Choi, K.-Y. Lee, Y.-H. Roh and Y.-G. Kim, “English-Korean Patent Translation System: FromTo-EK/PAT,” in MT Summit XI Workshop on Patent Translation Program, Copenhagen, Denmark, September 11, 2007.
[6]Omniscien Technologies, [Online]. Available: https://omniscien.com/?faqs=what-is-statistical-machine-translation-smt. [Accessed 10 September 2018].
[7]D. Turcato and F. Popowich, “What is Example-Based Machine Translation?,” in Carl and Way (2003), 2003, pp. 59-81.
[8]J. Ma, “Automata in Natual Language Processing,” Technical Report 0834, Laboratoire de Recherche er Developpement, l’Epita, France, December 2008.
[9]R. Emmanuel and S. Yves, “Introduction to Finite-State Devices in Natural Language Processing,” 1 January 1996.
[10]R. L. Schmidt, “Postpositions,” in Urdu, an Essential Grammar, London and New York, Routledge 11, New Fetter Lane London, 1999, pp. 68-85.
[11]Center For Language Engineering 2013, Urdu Parts of Speech Tagset, LAHORE: Center for Language Engineering, KICS, UET, 2013.
[12]S. Saxena, R. Raperya and N. K. Malik, “MACHINE LEARNING USING CHUNKING,” Iternational Journal of Advance Research in Science and Engineering, vol. 6, no. 02, pp. 285-292, February 2017.
[13]M. Hamada and S. Sato, “A Game-based Learning System for Theory of Computation Using Lego NXT Robot,” in International Conference on Computational Science, ICCS 2011, Procedia Computer Science 4 (2011), 2011, pp. 1944-1952.
[14]D. Ather, R. Singh and V. Katiyar, “Simplifying Designing Techniques: To Design DFA that Accept Strings over ∑= {a, b} having at least x Number of a and y Number of b,” International Journal of Computer Applications (0975 – 8887), vol. 91, no. 07, pp. 12-17, April 2014.
[15]S. Wason, S. Rathi and P. Kumar, “RESEARCH PAPER ON AUTOMATA,” 2014 IJIRT, ISSN: 2349-6002, vol. 1, no. 5, pp. 507-510, 2014.
[16]W. M. Soon, H. T. Ng and D. C. Y. Lim, “A Machine Learning Approach to Coreference Resolution of Noun Phrases,” Association for Computational Linguistics, vol. 27, no. 4, pp. 521-543, 2001.
[17]M. Christopher D and S. Hinrich, “Foundations of Statistical Natural Language Processing,” Cambridge, MA: The MIT Press, 1999, vol. 26, no. 2, 1999.
[18]C. L. Vitto, “Verb Phrase,” in Grammer by Diagram Second Edition, Broadview Press , November 2008, pp. 26-30.
[19]H. Broekhuis, H. Broekhuis and R. Vos, “PP-complements (prepositional objects),” in Syntax of Dutch Verbs and Phrases Volume 1, Amsterdam, Amsterdam University Press, 2015, pp. 284-321.
[20]K. H. Rosen, “Predicates and Quantifiers,” in Discrete Mathematics and its Applications Seventh Edition, 1221Avenue of the Americas, NewYork, McGraw-Hill, a business unit of The McGraw-Hill Companies, Inc., 2012, pp. 37-40.
[21]P. Bakliwal, D. V V and C. V. Jawahar, “Align Me : A framework to generate Parallel Corpus Using OCRs & Bilingual Dictionaries,” in Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing, pages 183–187, Osaka, Japan, December 2016.
[22]L. Sweeney, “That’s AI?: A History and Critique of the Field,” School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213-3890, pp. 19-23, July 2003.
[23]R. Weischedel, J. Carbonell, B. Grosz, W. Lehnert, M. Marcus, R. Perrault and R. Wilensky, “White Paper on Natural Language Processing,” in Proc. DARPA Speech and Natural Language Workshop, Harwich Port, Massachusetts, October 1989.
[24]R. Davis, H. Shrobe and P. Szolovits, “What is Knowledge Represenatation?,” AI Magazine, pp. 8-15, 1993.
[25]N. Mandelblit, “Machine Translation: A Cognitive Linguistics Approach,” in Proceedings of the 5th Int. Conf. on Theoretical and Methodological Issues in Machine, Kyoto, Japan, 1993.
[26]N. Chomsky, Knowledge of Language: Its Nature, Origin, and Use, USA: Praeger Publishers, 521 Fifth Avenue, New York, NY 10175, 1986.
[27]B. Vauquois, “A Survey of Formal Grammars and Algorithms for Recognition and Transformation in Machine Translation,” in Proceedings of the IFIP Congress-6, Edinburgh, 1968, pp. 254-260.
[28]S. Tripathi and J. K. Sarkhel, “Approaches to machine translation,” Annals of Library and Information Studies, vol. 57, pp. 388-393, December 2006.
[29]G. Jirásková and M. Palmovský, “Kleene Closure and State Complexity,” ITAT 2013 Proceedings, CEUR Workshop Proceedings, vol. 1003, pp. 94-100, 2013.
[30]P. Schachter and T. Shopen, “Parts of speech systems,” in Language Typology and Syntactic Description: Clause Structure, vol. 1, Cambridge University Press, October 2007.
[31]M. Stavrou and A. Terzi, “Types of Numerical Nouns,” in Proceedings of the 26th West Coast conference on formal linguistics. eds. Charles B. Chang and Hannah J. Haynie, Somerville, MA: Cascadilla, 2008.
[32]G. Kleiser, “What is a verb?,” in Exploring English Grammar, New Delhi, APH Publishing Corporation, 2008, p. 2.
[33]J. Walter, “Building Writing Skills the Hands-on Way,” in Building Writing Skills the Hands-on Way, Boston, MA 02210 USA, Cengage Learning, 2016, pp. 165-171.
[34]D. Veselka, English Articles and Determiners, Independently published, 2017.
[35]J. J. Webster and C. Kit, “TOKENIZATION AS THE INITIAL PHASE IN NLP,” in PROC. OV COLING-92, NANTES, pp. 1106-1110,AUG. 23 28, 1992, Nantes, 1992.
[36]A. Amjad and M. A. Khan, “Selecting Predicate Logic for Knowledge Representation by Comparative Study of Knowledge Representation Schemes,” in International Conference on Emerging Technologies, IEEE, pp.23-28, 2009.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *

error

Enjoy this blog? Please spread the word :)