Welcome to the spoken subcorpus of the Reference Corpus of Contemporary Portuguese (CRPC) at the Center of Linguistics of the University of Lisbon.
This corpus is a compilation of several spoken corpora, compiled for specific projects:
- C-ORAL-ROM (Spoken Corpus of Romance Languages)
- Português Fundamental
- Português Falado (Spoken Portuguese)
Repeated files have been eliminated to create a single corpus.
The CRPC-Oral is text-to-sound aligned.
The C-ORAL-ROM and Português Fundamental are restricted to European Portuguese, while Português Falado covers all national varieties of Portuguese.
Transcription
Dysfluencies are visualized in the transcription and marked with XML elements as described in the table.
Visualization in the transcription | XML element | |
Repetitions | strikethrough | del reason="repetition" |
Truncated | strikethrough | del reason="truncated" |
Reformulations | strikethrough | del reason="reformulation" |
Reconstructed | different colour ex: vida |
supplied |
Short pause | / | pause type="short" |
Long pause | // | pause type="long" |
Filled pauses Extra-linguistic elements |
ah |
vocal desc |
Interrupted / abandoned segment |
+ |
shift |
View Options
When opening a file, two options are available:
Search
Use the query builder to perform a query.
The result of the query is a list of context.
When clicking on one context, the full transcription is provided. The full audio file can be listened to.
By selecting “audio” in the View Options (top of the transcription), each line can be listened to separately.
Annotation
The corpus is lemmatized and tagged with POS. The annotation can be queried in the menu Search.
See Tagset on the left menu.
Additional information at the project's webpages