library(tidyverse)
library(easystats)
data("germeval_train", package = "pradadata")
germeval01-textfeatures
2023
textmining
datawrangling
germeval
string
Aufgabe
Extrahieren Sie typisches Text-Features aus einem Text.
Nutzen Sie das Paket textfeatures
.
Nutzen Sie die GermEval-2018-Daten.
Die Daten sind unter CC-BY-4.0 lizensiert. Author: Wiegand, Michael (Spoken Language Systems, Saarland University (2010-2018), Leibniz Institute for the German Language (since 2019)),
Die Daten sind auch über das R-Paket PradaData zu beziehen.
Nutzen Sie diesen Text-Datensatz, bevor Sie den größeren germeval
-Datensatz verwenden:
Daten
Teststring:
<- c("Abbau, Abbruch ist jetzt",
text "Test 🧑🎓 😄 heute!!",
"Abbruch #morgen #perfekt",
"Abmachung... LORE IPSUM",
"boese ja", "böse nein",
"hallo ?! www.google.de",
"gut schlecht I am you are he she it is")
<- c(2, 0, 2, 1, 1, 1, 0, 2)
n_emo
<-
test_text data.frame(id = 1:length(text),
text = text,
n_emo = n_emo)
test_text
id text n_emo
1 1 Abbau, Abbruch ist jetzt 2
2 2 Test 🧑🎓 😄 heute!! 0
3 3 Abbruch #morgen #perfekt 2
[ reached 'max' / getOption("max.print") -- omitted 5 rows ]
Hinweise:
- Orientieren Sie sich im Übrigen an den allgemeinen Hinweisen des Datenwerks.
Lösung
Das Paket textfeatures
ist aktuelle nicht auf CRAN, aber über Github zu bekommen (oder im CRAN-Archiv).
library(tidyverse)
library(tictoc)
library(textfeatures)
Test 1
Hier ein Test vom Autor des Pakets:
<- c(
x "this is A!\t sEntence https://github.com about #rstats @github",
"and another sentence here", "THe following list:\n- one\n- two\n- three\nOkay!?!"
)
## get text features
::textfeatures(x, verbose = FALSE) textfeatures
# A tibble: 3 × 36
n_urls n_uq_urls n_hashtags n_uq_hashtags n_mentions n_uq_mentions n_chars
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.15 1.15 1.15 1.15 1.15 1.15 0.243
2 -0.577 -0.577 -0.577 -0.577 -0.577 -0.577 -1.10
3 -0.577 -0.577 -0.577 -0.577 -0.577 -0.577 0.856
# ℹ 29 more variables: n_uq_chars <dbl>, n_commas <dbl>, n_digits <dbl>,
# n_exclaims <dbl>, n_extraspaces <dbl>, n_lowers <dbl>, n_lowersp <dbl>,
# n_periods <dbl>, n_words <dbl>, n_uq_words <dbl>, n_caps <dbl>,
# n_nonasciis <dbl>, n_puncts <dbl>, n_capsp <dbl>, n_charsperword <dbl>,
# sent_afinn <dbl>, sent_bing <dbl>, sent_syuzhet <dbl>, sent_vader <dbl>,
# n_polite <dbl>, n_first_person <dbl>, n_first_personp <dbl>,
# n_second_person <dbl>, n_second_personp <dbl>, n_third_person <dbl>, …
Test 2
::textfeatures(test_text$text,
textfeaturessentiment = FALSE,
word_dims = FALSE)
[32m↪[39m [38;5;244mCounting features in text...[39m
[32m↪[39m [38;5;244mParts of speech...[39m
[32m↪[39m [38;5;244mWord dimensions started[39m
[32m↪[39m [38;5;244mNormalizing data[39m
[32m✔[39m Job's done!
# A tibble: 8 × 29
n_urls n_uq_urls n_hashtags n_uq_hashtags n_mentions n_uq_mentions n_chars
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 -0.354 -0.354 0 0 0.532
2 0 0 -0.354 -0.354 0 0 0.0800
3 0 0 2.47 2.47 0 0 0.589
4 0 0 -0.354 -0.354 0 0 0.532
5 0 0 -0.354 -0.354 0 0 -1.86
6 0 0 -0.354 -0.354 0 0 -1.25
7 0 0 -0.354 -0.354 0 0 0.471
8 0 0 -0.354 -0.354 0 0 0.910
# ℹ 22 more variables: n_uq_chars <dbl>, n_commas <dbl>, n_digits <dbl>,
# n_exclaims <dbl>, n_extraspaces <dbl>, n_lowers <dbl>, n_lowersp <dbl>,
# n_periods <dbl>, n_words <dbl>, n_uq_words <dbl>, n_caps <dbl>,
# n_nonasciis <dbl>, n_puncts <dbl>, n_capsp <dbl>, n_charsperword <dbl>,
# n_first_person <dbl>, n_first_personp <dbl>, n_second_person <dbl>,
# n_second_personp <dbl>, n_third_person <dbl>, n_tobe <dbl>,
# n_prepositions <dbl>
Categories:
- 2023
- textmining
- datawrangling
- germeval
- string