emojis1

emoji

textmining

string

Published

November 16, 2023

Aufgabe

Extrahieren Sie die Anzahl der Emojis aus einem Text.

Nutzen Sie die GermEval-2018-Daten.

Die Daten sind unter CC-BY-4.0 lizensiert. Author: Wiegand, Michael (Spoken Language Systems, Saarland University (2010-2018), Leibniz Institute for the German Language (since 2019)),

Die Daten sind auch über das R-Paket PradaData zu beziehen.

library(tidyverse)
library(easystats)
data("germeval_train", package = "pradadata")

Nutzen Sie diesen Text-Datensatz, bevor Sie den größeren germeval-Datensatz verwenden:

Daten

Teststring:

text <- c("Abbau, Abbruch ist jetzt", 
          "Test   🧑‍🎓 😄 heute!!", 
          "Abbruch #morgen #perfekt", 
          "Abmachung... LORE IPSUM", 
          "boese ja", "böse nein", 
          "hallo ?! www.google.de", 
          "gut schlecht I am you are he she it is")

n_emo <- c(2, 0, 2, 1, 1, 1, 0, 2)

test_text <-
  data.frame(id = 1:length(text),
         text = text,
         n_emo = n_emo)

test_text

  id                     text n_emo
1  1 Abbau, Abbruch ist jetzt     2
2  2   Test   🧑‍🎓 😄 heute!!     0
3  3 Abbruch #morgen #perfekt     2
 [ reached 'max' / getOption("max.print") -- omitted 5 rows ]

Hinweise:

Orientieren Sie sich im Übrigen an den allgemeinen Hinweisen des Datenwerks.

Lösung

Setup

library(tidyverse)
library(tictoc)
library(emoji)  # emoji_xxx
library(tidyEmoji)

Test 1

test_text |> 
  mutate(n_emojis = emoji_count(text))

  id                     text n_emo n_emojis
1  1 Abbau, Abbruch ist jetzt     2        0
2  2   Test   🧑‍🎓 😄 heute!!     0        3
 [ reached 'max' / getOption("max.print") -- omitted 6 rows ]

Das Paket emoji beinhaltet eine Menge Emojis:

emoji_name |> length()

[1] 4538

Test2

test_text$text |> 
  emoji_subset()

[1] "Test   🧑‍🎓 😄 heute!!"

TidyEmoji - Emojis kategorisieren

data.frame(tweets = c("I love tidyverse \U0001f600\U0001f603\U0001f603",
"R is my language! \U0001f601\U0001f606\U0001f605",
"This Tweet does not have Emoji!",
"Wearing a mask\U0001f637\U0001f637\U0001f637.",
"Emoji does not appear in all Tweets",
"A flag \U0001f600\U0001f3c1")) %>%
emoji_categorize(tweets)

# A tibble: 4 × 2
  tweets                   .emoji_category        
  <chr>                    <chr>                  
1 I love tidyverse 😀😃😃  Smileys & Emotion      
2 R is my language! 😁😆😅 Smileys & Emotion      
3 Wearing a mask😷😷😷.    Smileys & Emotion      
4 A flag 😀🏁              Smileys & Emotion|Flags

test_text |> 
  emoji_categorize(text)

# A tibble: 1 × 4
     id text                 n_emo .emoji_category                        
  <int> <chr>                <dbl> <chr>                                  
1     2 Test   🧑‍🎓 😄 heute!!     0 Smileys & Emotion|People & Body|Objects

data(wild_emojis, package = "pradadata")

wild_emojis |> 
  emoji_categorize(emoji)

# A tibble: 28 × 2
   emoji .emoji_category                           
   <chr> <chr>                                     
 1 💣    Smileys & Emotion|NULL|NULL|NULL|NULL|NULL
 2 💀    Smileys & Emotion|NULL|NULL|NULL|NULL|NULL
 3 ☠️     Smileys & Emotion|NULL|NULL|NULL|NULL|NULL
 4 😠    Smileys & Emotion|NULL|NULL|NULL|NULL|NULL
 5 👹    Smileys & Emotion|NULL|NULL|NULL|NULL|NULL
 6 💩    Smileys & Emotion|NULL|NULL|NULL|NULL|NULL
 7 😡    Smileys & Emotion|NULL|NULL|NULL|NULL|NULL
 8 🤢    Smileys & Emotion|NULL|NULL|NULL|NULL|NULL
 9 🤮    Smileys & Emotion|NULL|NULL|NULL|NULL|NULL
10 😖    Smileys & Emotion|NULL|NULL|NULL|NULL|NULL
# ℹ 18 more rows

Alternativ kann man auch via Regex und Unicode Regex ansprechen… emoji_pattern <- "\\p{So}".

Das ist vermutlich cleverer 🤓.

Categories:

emoji
textmining
string