R&D Unit funded by

CRPC-Oral: Spoken Subpart of the Reference Corpus of Contemporary Portuguese

Welcome to the spoken subcorpus of the Reference Corpus of Contemporary Portuguese (CRPC) at the Center of Linguistics of the University of Lisbon.

This corpus is a compilation of several spoken corpora, compiled for specific projects:

- C-ORAL-ROM (Spoken Corpus of Romance Languages)

- Português Fundamental  

- Português Falado (Spoken Portuguese)

Repeated files have been eliminated to create a single corpus.

The CRPC-Oral is text-to-sound aligned.

The C-ORAL-ROM and Português Fundamental are restricted to European Portuguese, while Português Falado covers all national varieties of Portuguese.

 

Transcription

Dysfluencies are visualized in the transcription and marked with XML elements as described in the table.

 

   Visualization in the transcription  XML element
 Repetitions   strikethrough  del reason="repetition"
 Truncated   strikethrough   del reason="truncated"
 Reformulations   strikethrough  del reason="reformulation"
 Reconstructed  different colour  
 ex: vida
 supplied
 Short pause   /   pause type="short"
 Long pause   //  pause type="long"
 Filled pauses  Extra-linguistic elements

 ah
 eh
 hhh

 vocal  
 desc
 Interrupted / abandoned segment

 +

 shift

 

View Options

When opening a file, two options are available:

  • “transcription”: the full transcription is provided. For instance, repetitions and interrupted words are strikethrough
  • “Dialogue form”: repetitions, reformulations and truncated words are not visible

 

Search 

Use the query builder to perform a query. 

  • Queries can be restricted by year / country / project / channel / title.

The result of the query is a list of context.

When clicking on one context, the full transcription is provided. The full audio file can be listened to.

By selecting “audio” in the View Options (top of the transcription), each line can be listened to separately.

 

Annotation

The corpus is lemmatized and tagged with POS. The annotation can be queried in the menu Search.

See Tagset on the left menu.

 

Additional information at the project's webpages

C-ORAL-ROM

Português Fundamental

Português Falado