Framing an inquiry: induction, deduction, hypotheses

 

Hermann Moisl


 

Introduction

 

Linguistics is a science, and must therefore be pursued on scientific principles. Last week's lecture considered the nature of scientific inquiry with special reference to Karl Popper's proposals on scientific method, on which current mainstream science is based. The present lecture continues that discussion. It will initially cover some of the same ground as last week's --though possibly from a different angle-- with the aim of setting the scene, but will then move on to some new ideas.

 

The discussion is in three main parts: the first part looks at varieties of inference, the second at the role of these varieties of inference in scientific method, an the third at the application of scientific method to linguistic research.


 

 1. Ways of knowing: deductive, inductive and abductive inference

 

Informally, an inference is a conclusion drawn from facts that one knows:

 

  • Your umbrella is wet, therefore it's raining.
  • It's three o'clock and the appointment was for two, therefore he's not coming.
  • I think, therefore I am.

 

The whole point of science is to make valid inferences about the natural world. Note that the inferences must be valid; what constitutes a valid inference will be addressed in what follows.

 

There are three types of inference: deductive, inductive, and abductive. We'll look at these in turn, and then relate them to scientific method in the subsequent section. By way of example we will use a jar full of red marbles (from the Stanford Encylopedia of Philosophy: http://plato.stanford.edu/entries/peirce/#dia)

  • Deductive inference

A deductive inference is one that follows necessarily from given premises or, less formally, from a given fact or facts: if the given fact or facts are true, an inference from those facts using the rules of logic must also be true. Given our example urn:

-- Premise: All marbles in the urn are red

-- Observation: This marble is from the urn

-- Inference: This marble is red

Given that all marbles in the urn are red, and that I have a marble taken from the urn, it is necessarily true that that marble will be red; if the marble is not red, that is, if the inference is untrue, then either the premise or the observation or both must be untrue. The form of the argument, however, is not in question --it is an absolute rule of logic, and, where it is used, it will always derive true inferences from true premises and observations. A deductive inference from true premises and observations using the rules of logic is always valid with respect to the world.

 

Various rules of logic are used in deductive inference. The one used above is called modus ponens; others can be looked up in the relevant section of the readings are the end of this discussion.

 

  • Inductive inference

In inductive inference there are no premises. Instead, an inductive inference is a generalization based entirely on observation of the world: given some number of observations an inference is drawn from them. Referring again to the urn, if Gordon Brown gives me a sequence of marbles which he says are from the urn, and if all the marbles are red, my inference is that all the marbles in the urn are red. Clearly, this inference is not necessarily true. It may be that there are other colours in there as well, and that it just so happened that all the marbles I saw were red. In other words, inductive inferences are not necessarily true in the way that deductive ones are, and are therefore not guaranteed to be valid with respect to the world.

 

There are no rules of inductive inference in the way that there are for deductive inference. Instead, we have statistics. Statistics is the discipline that uses sample observations of the world to make inferences about the state of the world, and to assess the probability that such inferences are true.

 

  • Abductive inference

This type of inference is less well known than the preceding ones, and takes a little getting used to. Like deductive inference it starts with premises and makes observations, but the inferences do not result from application of the rules of logic. Let's go back to our urn. An abductive inference would go like this:

-- Premise: All marbles in the urn are red

-- Observation: I have a number of red marbles

-- Inference: The marbles I have came from the urn

Clearly, the inference does not follow from the premise and the observation, that is, the inference is not necessarily true and therefore not necessarily valid. The inference is reasonable given the premise and the observation, but others are possible --for example, that the marbles I have came from another urn with red marbles in it, or from my pocket.

 


 

2. Scientific method

'Science is a method of discovering reliable knowledge about nature' (Schafersman, http://www.freeinquiry.com/intro-to-sci.html). To do this, it applies the three types of inference just discussed in a specific sequence:

  • Step 1: Given some domain of interest, the researcher defines a research question. In meteorology, for example, the researcher has noticed that, in geographical regions where the weather is generally stable, there is little variation in soil temperature, and that, in geographical regions where the weather is generally unstable (the UK!) there is considerable variation in soil temperature. S/he therefore poses a research question: 'What is the connection between soil temperature and weather stability?'.

  • Step 2: Data is gathered, examined, and inferences are drawn using inductive inference. The meteorologist takes a series of measurements of land temperatures and such things as degree of cloud cover and amount of rainfall across a range of geographical areas and then draws inferences from these measurements like: soil temperature is proportional to intensity and duration of sunshine, soil temperature is inversely proportional to the amount and duration of cloud cover, and so on.

  •  Step 3: The researcher draws an abductive inference. The premise is that there is an observed connection between weather stability and soil temperature. The observations are the inferences drawn at step 2. The abductive inference is, say: 'Variation in soil temperature causes unstable weather'. This is a reasonable inference to explain the premise in terms of the observations, but it is not the only possible one --that is, the abductive inference doesn't necessarily follow. Another plausible one would be: 'Unstable weather causes variation in soil temperature'. Yet another would be: 'Unstable weather and variation in soil temperature are both caused by a third factor not yet taken into account, such as variation in the motion of the jet stream in the North Atlantic'.

In science, such abductive inferences are called 'hypotheses', and we shall refer to them as hypotheses from now on.

  • Step 4: The hypothesis is tested in two ways

i. Is additional data taken from any geographical region in the world compatible with the hypothesis? If the additional measurements allow the same inductive inferences as the original data, then it is compatible with the hypothesis, and the hypothesis is supported. If not, new inductive inferences have to be made, and the hypothesis may have to be emended or abandoned.

ii. Are deductive inferences drawn from the hypothesis compatible with observed reality? Such deductive inferences are necessarily true if the premise (the hypothesis) is true and the observations (the inductive inferences) are valid; if observation of reality shows a deductive inference to be false, and if the observations are valid, then the hypothesis must be false. The meteorologist's hypothesis in the present case not only claims that variation in soil temperature causes unstable weather in the geographical regions from which the original data was taken, but also, by deductive inference, that in any region in which there is variation in soil temperature, the weather will be unstable. S/he surveys additional regions: if, in each case, there is unstable weather when the soil temperature is variable and stable weather when it is not, the hypothesis is supported, but if even one case is found where there is soil temperature variation and stable weather, or vice versa, the hypothesis will need to be emended or abandoned. This is Popperian falsification.

We shall see a detailed example of how this method is applied to linguistic research in the next section. 

 


 

3. Scientific method and linguistic research

 

In linguistic research one typically chooses one of the main subdivisions of the subject --generative linguistics, historical linguistics, sociolinguistics and so on, and in the chosen subdivision one selects some particular area --syntax, phonology...etc. The method then applies as follows.

 

 

Step 1: Formulation of a research question

Formulation of a research question is a matter of reading in the selected subject area, discussing the subject area with colleagues or, perhaps, a supervisor, and eventually identifying a topic that is, on the one hand, interesting to the researcher and the linguistics research community, and on the other has not been satisfactorily investigated or indeed investigated at all. To exemplify this and subsequent steps in the application of scientific method we will select dialectology as our discipline subdivision, and within that subdivision the area of phonetics, though clearly the discussion from this point onwards applies to other linguistic subdivisions, subject areas, and topics more generally. Our research question will be a very simple one:

 

'Is there systematic phonetic variation in the Tyneside speech community?'.

 

Step 2: Data gathering and drawing of inductive inferences

To address the research question, a sample of informants drawn from the Tyneside speech community has to be selected and the phonetic characteristics of their speech observed. Observation of their speech characteristics generates the data on which the investigation is based, and inductive inferences about the phonetic characteristics of the Tyneside speech community will be based on that data.

 

In principle one should at this stage identify and interview informants, gathering a collection or 'corpus' of Tyneside speech as the source of the data. For convenience, though, we will use an existing collection: the Newcastle Electronic Corpus of Tyneside English, or NECTE for short. NECTE is a corpus of Tyneside dialect speech  which includes phonetic transcriptions of 63 speaker interviews, and is therefore exactly what is needed to address our research question. We will use these phonetic transcriptions to generate our data.

 

What should that data look like? In other words, what kind of data would allow us to investigate the the phonetic characteristics of Tyneside speakers, and more specifically to see if there are systematic phonetic variations among its speakers? The obvious answer is to create a profile of each speaker's phonetic usage, and then to compare the speakers with one another.

 

 What should such a phonetic profile look like? In the NECTE phonetic transcription scheme there are 158 symbols, that is, the audio signal from the recorded interviews is interpreted in terms of 158 different phonetic segments. Let us, therefore, represent each speaker with a vector containing 158 elements, in which each element represents a different phonetic segment:

 

In each vector element we put a numerical value which represents the number of times the speaker uses that element in his or her interview:

 

Thus, this speaker uses 23 times, 4 times, and so on. Since there are 63 speakers in the NECTE corpus, the data is a matrix with 63 rows each of which is a phonetic profile for a different speaker:

 

Having constructed this matrix, inductive inferences can be drawn from it by comparing the 63 profiles with one another. But there is a problem. The figure immediately above gives a very small fragment of the actual matrix in order to show it easily on the page. The complete matrix can be seen by following this link. The problem is easy enough to see. One can look at this matrix until the cows come home, but the inductive inferences don't come

 

This highlights a general problem that has been with most of the physical and social sciences for a long time: information overload. Human cognition isn't good at seeing patterns in large collections of numbers. This is now becoming an issue in linguistics. The advent of electronic text in the final decades of the 20th century has generated large and, increasingly, very large collections of natural language text that is available for exploitation by linguistics. The problem is that these ever-growing collections are typically too large for any individual researcher to read though and abstract data from in a reasonable time or indeed in a lifetime, and data abstracted from them is impenetrable, as we have just seen. One alternative is to ignore this development and to deal only with collections of tractable size --this much data and no more. This is not scientifically respectable. The other alternative is to use the large collections, and to use computational methods support our inferences. That is the approach taken here.

 

Because information overload has been a problem for a long time in the other sciences, many computational methods to support the drawing of inferences from data have been developed, and we can't even begin to attempt an overview of them here. One method is therefore selected to exemplify how they can be used for inductive inference from data: cluster analysis.

 

To understand cluster analysis, vector space geometry first has to be understood. A vector is a sequence of numbers, as we saw above, but it is not just a sequence of numbers. Any vector has a geometrical interpretation. To see how, assume a vector consisting of two elements, say v = [30,70]. Under a geometrical interpretation, the two elements of v define a two-dimensional space, the numbers at v[1] = 30 and v[2] = 70 are coordinates in that space, and the vector v itself is a point at the coordinates [30,70].

A vector consisting of three elements, say v = [40,20,60] defines a three-dimensional space in which the coordinates of the point v are 40, 20, and 60:

 

A vector v = [22,38,52,12] defines a four-dimensional space with a point at the stated coordinates, and so on to any dimensionality n. Vector spaces of dimensionality greater than 3 are impossible to visualize directly and are therefore highly counterintuitive, but mathematically there is no problem with them; 2 and 3 dimensional spaces are useful as a metaphor for conceptualizing higher-dimensional ones.

 

When numerous vectors exist in a space, it may or may not be possible to see interesting structure in the way they are arranged in the space. The figure below shows vectors in two and three dimensional spaces. In (a) the vectors were randomly generated and there is no structure to be observed; in (b) there are two clearly defined concentrations in two-dimensional space; in (c) there are two clearly defined concentrations in three-dimensional space.

 

The existence of concentrations like those in (b) and (c) can indicate relationships among the entities that the vectors represent. In (b), for example, if the horizontal axis measures weight and the vertical one height for a sample human population, then members of the sample fall into two groups: tall, light people on the one hand, and short heavy ones on the other.

This idea of identifying clusters of vectors in vector space and interpreting them in terms of what the vectors represent is the basis of computational classification methods. In what follows, we shall be attempting to classify the NECTE speakers into groups on the basis of their phonetic usage by looking for clusters in the arrangement of the row vectors of M in 158-dimensional space.

Now, where the vectors are 2 or three-dimensional they can simply be plotted and any clusters will be visually identifiable, as we have just seen: But what about when the vector dimensionality is greater than 3 -say 4, or 10, or 100? In such a case direct plotting is not an option --how exactly would one draw a 6-dimensional space, for example? Many data matrix row vectors have dimensionalities greater than 3 --the NECTE matrix M has dimensionality 158-- and, to identify clusters in such high-dimensional spaces some more general procedure than direct plotting is required. A variety of such general procedures is available, and they are generically known as cluster analysis methods. This section looks at these methods.

  • Distance in vector space

Where there are two or more vectors in a space, it is possible (i) to measure the distance between any two of them, and (ii) to rank them in terms of their proximity to one another: the following figure shows a simple case of a 2-dimensional space in which the distance from vector A to Vector B is greater than the distance from A to C.

There are various ways of measuring such distances, but the most often used is the Euclidean distance familiar from school: 'In a right-angled triangle, the length of the square of the hypotenuse is the sum of the squares of the lengths of the other two sides'. In the following figure distance(AB)2 = (5 - 1)2 + (4 - 2)2.

 

  • Cluster analysis methods

Cluster analysis methods use relative distance among vectors in a space to group the vectors into clusters. Specifically, for a given set of vectors in a space, they first calculate the distances between all pairs of vectors, and then group into clusters all the vectors that are relatively close to one another in the space and relatively far from those in other clusters. 'Relatively close' and 'relatively far' are, of course, vague expressions, but they are precisely defined by the various clustering methods, and for present purposes we can avoid the technicalities and rely on intuitions about relative distance.

For concreteness, we will concentrate on one particular class of methods: hierarchical cluster analysis. Hierarchical cluster analysis represents the relativities of distance among vectors as a constituency tree. The following figure exemplifies this.

 

v1

v2

1

27

46

2

29

48

3

30

50

4

32

51

5

34

54

6

55

9

7

56

9

8

60

10

9

63

11

10

64

11

11

78

72

12

79

74

13

80

70

14

84

73

15

85

69

16

27

55

17

29

56

18

30

54

19

33

51

20

34

56

21

55

13

22

56

15

23

60

13

24

63

12

25

64

10

26

84

72

27

85

74

28

77

70

29

76

73

30

76

69

 

a b

 

Column (a) shows 30 x 2 data matrix that is to be cluster analyzed. Because the data space is 2-dimensional the vectors can be directly plotted to show the cluster structure; this is shown in the upper part of column (b). The corresponding hierarchical cluster tree is shown in the lower part of column (b). Tree diagrams like this are familiar to linguists as representations of sentence phrase structure, but differ from linguistic trees in the following respects:

  • The leaves are not lexical tokens but labels for the data items --the numbers at the leaves correspond to the numerical labels of the row vectors in the data matrix.

  • They represent not grammatical constituency but relativities of distance between clusters. The lengths of the branches linking the clusters represent degrees of closeness: the shorter the branch, the more similar the clusters: starting near the top of the tree, vectors 4 and 19 are very close and thus linked with very short lines; 2 and 3 are almost but not quite as close as 4 and 19, and are therefore linked with slightly longer lines, and so on.

Knowing this, the tree can be interpreted as follows. There are three clusters labelled A, B, and C in each of which the distances among vectors are quite small. These three clusters are relatively far from one another, though A and B are closer to one another than either of them is to C. Comparison with the vector plot shows that the hierarchical analysis accurately represents the distance relations among the 30 vectors in 2-dimensional space.

 

Given that the tree tells us nothing more than what the plot tells us, what is gained? In the present case, nothing. The real power of hierarchical analysis lies in its independence of vector space dimensionality. We have seen that direct plotting is limited to three or fewer dimensions, but there is no dimensionality limit on hierarchical analysis --it can determine relative distances in vector spaces of any dimensionality and represent those distance relativities as a tree analogous to the one above. To exemplify this, the 158-dimensional NECTE data matrix M was hierarchically cluster analyzed, and the results of the analysis are shown below.

 

  • Cluster analysis of the NECTE data matrix

Recall that the NECTE data is a 63 x 158 matrix M in which each of the 63 rows represents a speaker, each of the columns a phonetic segment, and the value at Mij is the number of times speaker i uses phonetic segment j. Each row vector is therefore a phonetic profile of a different NECTE speaker; the aim is to classify the speakers in terms of the similarity of their phonetic profiles or, put another way, in terms of the relative distances in the 158-dimensional phonetic space. The resulting tree is shown below.

 

 

Plotting M in 158-dimensional space would have been impossible, and, without cluster analysis, one would have been left pondering a very large and incomprehensible matrix of numbers. With the aid of cluster analysis, however, structure in the data is clearly visible.

 

We are now in a position to draw some inductive inferences:

-- The NECTE speakers fall into two main clusters of speakers based on their phonetic usage: NG1 and NG2.

-- The speakers of NG2 are not strongly differentiated in their phonetic usage.

-- The speakers of NG1 are strongly differentiated into two clusters NG1a and NG1b

-- The speakers of NG1a and NG1b are strongly differentiated into two subclusters.

It is doubtful whether these inferences could have been made just by visual examination of the data matrix.

 

Step 3: Hypothesis formulation

Based on the above inferences, the hypothesis that answers the research question is:

 

There is systematic phonetic variation in the Tyneside speech community

 

This hypothesis can be developed and made more useful by analyzing the main clusters to determine the phonetic segments that are most important in differentiating the speakers. Warren Maguire and I have done this in a paper that is available via this link, and, on the basis of that paper, the hypothesis can be restated as:

 

There is systematic phonetic variation in the Tyneside speech community, and the main determinants of that variation are the phonetic segments ə (reduced), ɔː, ɪ, and eɪ

 

Step 4: Hypothesis testing

 

This hypothesis can be tested with respect to its validity for the whole Tyneside speech community by seeing if additional speakers fall into the same clusters, and moreover do so mainly on the basis of the same phonetic segments.

 

 


Reading

 

Deductive inference

 Inductive inference

 

Abductive inference

Scientific Method

Hypothesis

An excellent book on the nature of science and on scientific method is: A.F. Chalmers, What is this thing called science?, 3rd ed., Open University Press

The NECTE corpus is available at: http://www.ncl.ac.uk/necte/

The results of cluster analysis of phonetic data abstracted from the NECTE corpus together with discussions of a range of issues associated with such analysis are available online at my personal website: http://www.staff.ncl.ac.uk/hermann.moisl/research.htm

 


Finally

You are required to write a short diagnostic essay based on lectures 1-3 of this module; details of what's required are available here. For my part, I'm required to provide a choice of three topics in case anyone wants to write on the material I have covered. These are:

1. It is commonly said that deductive inference can produce no new knowledge from the premises. Do you think that this is true or false? In either case, explain your position.

2. Abductive inference makes some people uncomfortable. If it has that effect on you, explain why.

3. 'There are no proofs in science, only hypotheses'. Discuss the validity of this statement with reference to the varieties of inference presented above and to Popperian falsification.