R&D Unit funded by

Corpora of PLE

Data collection for learning Portuguese as a second language

Methodology
Data were collected from participants through pre-defined guidelines that included (i) a language profile questionnaire that has been assigned a number; and (ii) the text identified with the respective number of the informant. This coding system allowed us to identify texts made by the same informant who only had to fill out the language profile questionnaire once. Upon release, the materials were transcribed, coded and organized.

1. Data from informants

At first, we created a Excel file, which would bring together the sociolinguistic data from participants. The information, organized by the universities where the collection was made, includes the following items that can be searched individually through the filter system:

a) personal data – age, sex, nationality, course/year of their course and university where the course is attended;

b) language course – the mother tongue (LM); education language; other languages known besides Portuguese, language in which the student has greater proficiency in addition to LM (levels from the Common European Framework of Reference for Languages (CEFR)).

c) Portuguese language – beginning year of study of Portuguese; other courses of Portuguese culture; contact with other Portuguese speakers; proficiency of Portuguese (levels from the CEFR)

d) stimuli (see point 2 of the Methodology)

2. Written productions

Each written production was obtained from a stimulus. Participating teachers were provided a list of 83 proposals for drafting (revised and expanded from the ones designed for the Phd of the coordinator of this project Isabel Leiria), organized into three thematic sections:
1. The individual
2. Society
3. The environment

We asked the teachers at the beginning of the project to select one/two stimuli from each of the three themes according to the students' learning level and personal preference. Consult here the selected stimuli and the number of productions obtained in each one.

Note: The materials provided by University of Pusan, identified with the code - PU, were not collected according to the guidelines of the present project and constitute text from exams. Therefore, sociolinguistic data of the informants were not available in this case. In order to respect the system of identification of the material it was assigned the number of stimuli that best fit the task performed.

3. Transcript standards

The texts were transcribed according to the following conventions (cf. Leiria, I. 2006 - Léxico, aquisição e ensino do Português Europeu língua não materna. Lisboa: FCG/FCT, p. 201):

< XXX >segments scratched
<(...)> scratched unreadable segments
/ xxx / segments added
/ * xxx / conjectured readings

In order to conceal the names and other elements that could replenish the identity of the informant, such elements were replaced by the code XXXX. This notation is also in line with the protocol of the PL2 Corpora Collection at the University of Coimbra.

4
. Encoding of texts collected

Each document is properly labeled with (i) university where the collection was made; (ii) level of proficiency in Portuguese at the time of collection (codes 1, 2 and 3 assigned, respectively, to levels A1-A2, B1-B2 and C1-C2 of the CEFR); (iii) number of informant (assigned in the form of language profile); and (iv) code of the stimulus (the codes have been respected according to the listed stimuli conveyed to teachers).

Thus, a text written at Rutgers University (RU), produced by a student-level A1-A2 (1), with identification number 07, under the stimulus 45.2L, has the following identification: RU_1_07_45.2L.