Emoji-Aware

Sentiment Analysis

😭 reads emoji signals ⚖️ calibrates context 🔍 exposes sentiment

An interpretable ML pipeline for short-form social text. Combines TF-IDF token weights with hand-engineered emoji and lexicon polarity features to capture sentiment that text-only models miss.

Enter Analyse
Shift + Enter New Line
Ready

awaiting_input

classifying

Engine Output Decoder

Understanding the weights, markers, and logic thresholds driving the inference result.

🏷️

Classification

The binary sentiment target: Positive or Negative. This is the final decision synthesized from linguistic and symbolic features.

🎯

Confidence Score

A probability percentage indicating the model's mathematical certainty. Scores are influenced by the strength and consistency of the detected signal.

⚖️

Top 3 Drivers

The three features that most shifted the result. Each driver has a coefficient weight and a rank: Critical, Strong, or Moderate. Positive weights push toward Positive, negative weights push toward Negative.

📡

Signal Stability

High Ambiguity triggers when Confidence falls below 70%. It indicates conflicting linguistic markers or purely neutral text.


System Overview

Human emotion is increasingly expressed in fragments — brief messages, compressed language, and symbolic cues. This project investigates how machine learning can interpret that signal. By training a statistical model on labeled examples of short-form social media text, the system learns patterns associated with positive and negative sentiment and applies them to new, unseen input.

Unlike traditional approaches that treat emojis as noise, this system treats them as measurable affective data. Textual patterns are represented using TF-IDF statistics, capturing how strongly words correlate with sentiment across the dataset, while emoji polarity signals are extracted as structured features. These complementary channels are fused into a unified representation that allows the classifier to evaluate linguistic and symbolic meaning simultaneously.

The pipeline follows a rigorous machine learning lifecycle: deterministic data cleaning, exploratory analysis, feature engineering, supervised training, and performance evaluation. The resulting model and feature vocabulary are serialized to preserve parity between research and deployment.

When text is submitted through the interface, it undergoes the same transformations used during training. While the engine relies on probability-based predictions, it is governed by a hybrid logic layer. This architecture balances learned statistical patterns with deterministic rules designed to mitigate positive drift and interpret complex symbolic signals like sarcasm. The sections below document the empirical findings and architectural decisions that enable this inference process.


Data Scale & Engineering Constraints

Scale Note: At 1,000 labeled tweets, this is a low-resource setting by design — one that demands feature density over model complexity. Logistic Regression was chosen deliberately: its coefficient weights remain fully traceable, and on a corpus this size, a simpler model generalises better than a complex one.

Two raw datasets feed the pipeline. The primary corpus — 1k_data_emoji_tweets_senti_posneg.csv — contains 1,000 short-form social media posts with binary sentiment labels and embedded UTF-8 emojis. The second, 15_emoticon_data.csv , is a 16-entry Unicode reference table used for emoji coverage analysis. Neither dataset required imputation — both were structurally clean after index sanitisation and column standardisation.

Source Data Ingestion

Raw datasets prior to normalization and feature projection

Primary Stream / Sparse Matrix

text: string → 1,265 dimensions (TF-IDF)

Captures linguistic intent through n-gram tokenisation. Emojis are preserved in-place as UTF-8 characters and pass through the vectoriser unchanged.

UNIGRAMS BIGRAMS
Polarity Signal / Count Features

symbols + words → 4 dimensions (Polarity Counts)

Captures symbolic and lexical intent via four scalar counts: emoji_pos, emoji_neg, word_pos, word_neg — derived from curated polarity lexicons, not the reference table.

SCALAR LABEL-AGNOSTIC
Dataset 01 · Text Corpus (Cleaned)
Show table

Representative record excerpts demonstrating the alignment between textual signal and sentiment labels.

Index Sentiment Post (Raw)
0 1 "Good morning every one"
1 0 "TW: S AssaultActually horrified..."
2 1 "Thanks by has notice of me Greetings : Jossett..."
3 0 "its ending soon aah unhappy 😧"
4 1 "My real time happy 😊"
Dataset 02 · Emoji Reference Table (Cleaned)
Show table

16-entry Unicode metadata table mapping emoji symbols to codepoints and canonical names. Used for coverage analysis — contains no polarity scores.

Index Emoji Unicode Name
0 😍 SMILING FACE WITH HEART-SHAPED EYES
1 😭 LOUDLY CRYING FACE
2 😊 SMILING FACE WITH SMILING EYES
Dataset 03 · Final Feature Matrix (final dataset)
Show table

1,000-row feature matrix combining raw text with four engineered polarity count signals — the exact input consumed by the production classifier.

Index Label Text emoji_pos emoji_neg word_pos word_neg
0 1 "Good morning every one" 0 0 1 0
1 0 "TW: S AssaultActually horrified..." 0 10 1 1
2 1 "Thanks by has notice of me Greetings : Jossett..." 0 0 0 0
3 0 "its ending soon aah unhappy 😧" 0 10 0 1
4 1 "My real time happy 😊" 10 0 1 0
</>
Technical Specification: Multi-Dimensional Polarity Vector

The extraction logic maps identified symbols into a 4-dimensional anchor vector: [pos_score, neg_score, neu_score, count]. By hashing the lexicon for O(1) lookup speed, the engine avoids computational overhead during runtime. This vector is concatenated to the primary TF-IDF sparse matrix, acting as a deterministic "bias" that anchors the probabilistic weight of the word-based model.


Research & Development

Click to expand steps
01 Data Ingestion

The pipeline initiates with an investigative ingestion pass of two complementary data streams. Rather than a simple bulk load, this stage validates the structural integrity of the raw corpora — identifying redundant metadata and non-standard column conventions before establishing a definitive data contract for all downstream processing.

Supervised Corpus

1k_data_emoji_tweets_senti_posneg.csv

  • 1,000 labeled social text samples
  • Embedded UTF-8 emoji tokens preserved in-place
  • Binary sentiment targets (0 = negative, 1 = positive)

Emoji Reference Table

15_emoticon_data.csv

  • 16 entries — Unicode codepoints & canonical names
  • Structural metadata only — no polarity scores encoded
  • Used downstream as an emoji coverage reference

Raw Schema — Tweets

Unnamed: 0 dropped
post text
sentiment label

Raw Schema — Emoji Ref

Unnamed: 0 dropped
Emoji emoji
Unicode codepoint unicode_codepoint

Investigation Output

"Ingestion identified redundant Unnamed: 0 indices and non-standard post / sentiment column naming. These findings served as the direct requirements for the dataset.py standardisation script and established the read-only contract: all downstream notebooks consume from data/processed/ only."

02 Data Cleaning & Validation

Acting on the findings from the ingestion pass, this stage enforces a deterministic cleaning protocol with hard validation gates. Rather than aggressive text transformation, the focus is structural standardisation and schema assertion — ensuring emojis and raw linguistic cues remain intact and unmodified for all downstream analysis.

Structural Alignment

  • Standardised Schema text, label
  • Type Enforcement str, int
  • Null Removal dropna + reset_index
  • Ref Table snake_case emoji, unicode_*

Preservation Guardrails

  • Zero-loss emoji retention — no replacement or collapse
  • No semantic transforms — raw text preserved exactly
  • No modeling assumptions introduced at this stage

Hard Validation Assertions

  • assert Schema is exactly ["label", "text"]
  • assert All labels are strictly binary — label ∈ {0, 1}
  • assert No empty text fields — len(text) > 0 for all rows
  • assert All 16 emoji reference entries are unique — no duplicate emoji characters

Pipeline halts if any assertion fails — this stage is gated, not advisory.

Dataset Contract: Two canonical artifacts are written to data/processed/tweets_clean.csv (1,000 rows × 2 cols) and emoji_reference_clean.csv (16 rows × 3 cols). Both are treated as immutable, read-only inputs for all subsequent feature engineering and modelling passes. Raw data is never accessed again.

03 Exploratory Analysis

Emoji Presence Rate

49.3% of Corpus

Exploratory analysis confirms that nearly half of all tweets contain at least one emoji — making emoji usage common, not incidental. This frequency alone justified proceeding with emoji-aware feature engineering rather than treating symbols as noise.

Label Distribution

Subtle Positive Lean

The corpus shows a natural skew toward positive sentiment (500 pos vs 436 neg). While stable, this 14.7% baseline lean informs the eventual need for deterministic guardrails.

Emoji Distribution

Bimodal & Sparse

~505 tweets carry zero emojis; ~465 carry exactly one. Fewer than 5% contain two or more — emojis act as binary anchors, not intensity counters.

Label Correlation

Positive Bias Anchor

Positive tweets use emojis at a significantly higher rate. As affective amplifiers, these symbols frequently outweigh textual weight, justifying a hybrid logic layer to mitigate positive drift.

Coverage Gap — The Critical Finding

34

Unique emojis
in corpus

16

Emojis in
reference table

12

Intersection
(overlap)

The reference table covers only 12 of the 34 unique emojis found in the corpus. This gap — confirmed here empirically — became the direct motivation for building an expanded polarity lexicon in the feature engineering stage.

Account for Positive Lean Orthogonal Signal Low-Dimensional Features Justified No Dense Representations

Design Decision

"EDA confirmed emoji-aware feature engineering is justified — but under strict constraints: simple, interpretable features only; no dense emoji representations. The subtle 14.7% positive class skew validates the use of a curated polarity lexicon paired with a deterministic logic layer to ensure reliable inference."

04 Emoji Sentiment Referencing

Before committing to any emoji feature design, the pipeline runs a controlled ablation — establishing a text-only baseline, then testing three emoji feature types incrementally. Only features that demonstrably improve on the baseline are retained.

Feature Ablation — Validation F1

Text Only (Baseline)

TF-IDF bigrams, 1,265 features

0.796

Reference

+ Emoji Presence

Binary has_emoji flag

0.775

−0.021 ✕ Rejected

+ Emoji Count

Raw emoji frequency per tweet

0.785

−0.011 ✕ Rejected

+ Emoji Polarity Counts

Separate pos / neg counts via curated lexicon

0.8235

+0.027 ✓ Accepted

Polarity Feature Construction NB 3.0 — §8

# Manually defined polarity sets (label-agnostic)

POSITIVE_EMOJIS = {"😍", "😊", "😁", "😘"}

NEGATIVE_EMOJIS = {"😭", "😧"}

# Count matches per polarity class

pos = sum(1 for c in text if c in POSITIVE_EMOJIS)

neg = sum(1 for c in text if c in NEGATIVE_EMOJIS)

# Stack with TF-IDF → final feature matrix

X_combined = hstack([X_tfidf, [[pos, neg]]])

                           ↑ 1,265 dims + 2 dims = 1,267

Key Finding

"Feature semantics matter more than feature inclusion. Undirected emoji signals — presence or raw count — introduce more noise than signal. Only when direction is resolved (positive vs. negative) do emoji features improve on the text-only baseline. This distinction drives every decision downstream."

05 Feature Engineering

With the polarity mechanism validated in Step 4, this stage asks how far it can be pushed. Four independent signal sources are engineered, each justified by corpus evidence, and stacked into a single production feature matrix that features.py implements exactly.

1

Expanded Emoji Polarity

Lexicon expanded from 6 to 193 emojis — closing the 15.2% corpus coverage gap identified in EDA. Produces emoji_pos_count and emoji_neg_count.

0.834 F1

+0.038 vs baseline

2

Emoticon Fallback

Western-style emoticons (:), :() are invisible to character-level emoji extraction. Substring matching closes this structural gap.

0.838 F1

+0.042 vs baseline

3

Word Lexicon

TF-IDF may underweight low-frequency words. A deterministic lexicon injects stable priors as word_pos_count and word_neg_count.

0.840 F1

+0.044 vs baseline

4

Deterministic Sarcasm Veto

Final hybrid layer. If emoji_neg ≥ 10 and model prediction is Positive, the veto forces a Negative override to mitigate positive drift.

Hybrid Veto
Production Feature Matrix features.py

X_final = hstack([

X_tfidf,          # 1,269 dims

emoji_pol,        # 2 dims (×BOOST)

emoticon_pol,     # 2 dims (×BOOST)

word_lex,         # 2 dims

sarcasm_veto,     # Hybrid Logic Layer

])  # → Deterministic Anchor Applied

06 Model Training & Evaluation

The production classifier is Logistic Regression — selected for interpretability and stability on a 1,000-sample dataset. Model selection was closed in NB 3.0; this stage trains on the finalised feature matrix, gates on a hard performance benchmark, and serialises the artifacts.

Validation Performance — 200 Held-Out Samples

0.8402

F1 Score ✓ Benchmark Met

82.5%

Accuracy

Negative Class

Precision 0.90
Recall 0.73

Positive Class

Precision 0.77
Recall 0.92

27 false positives vs 8 false negatives — the model is generous with positive predictions. The sarcasm veto in predict.py directly targets this failure mode.

Learned Feature Weights — Surprising Result

word_neg_count

Word lexicon — negative hits

−4.079

word_pos_count

Word lexicon — positive hits

+0.856

emoji_pos_count

Emoji polarity — positive

+0.148

emoji_neg_count

Emoji polarity — negative

−0.049

Architectural Implication

"The word lexicon is the single strongest feature in the model at −4.08, outweighing every TF-IDF unigram. Meanwhile, emoji coefficients are surprisingly weak (+0.15, −0.05) — the model treats word signal as more reliable than emoji signal. This is precisely why the sarcasm veto must exist as a deterministic rule rather than a learned weight: the model alone cannot resolve emoji-text conflict."

07 Inference Pipeline

The finalised system is encapsulated into a two-layer inference pipeline. Frozen artifacts guarantee that every live input undergoes the exact transformation sequence validated during training — followed by a deterministic veto layer that overrides the model when emoji-text conflict is detected.

1

Model Layer — Statistical Prediction

Frozen sentiment_model.pkl and tfidf_vectorizer.pkl reconstruct the 1,269-feature matrix at inference time. The model returns a probability estimate and a provisional label.

2

Veto Layer — Deterministic Override

If emoji_neg_count > 0 and the model predicted Positive, the veto fires — overriding to Negative. A word guard suppresses the veto when positive text overwhelms the emoji signal. This layer exists because the model's emoji coefficients are too weak (+0.15, −0.05) to resolve conflict reliably on their own.

Inference Blueprint predict.py

# 1. Load frozen artifacts

model = load("sentiment_model.pkl")

tfidf = load("tfidf_vectorizer.pkl")

# 2. Assemble 1,269-feature matrix

X = hstack([tfidf.transform(text), emoji_pol, word_lex])

# 3. Model prediction

proba = model.predict_proba(X) → [0.07, 0.93]

# 4. Veto check

if emoji_neg_count > 0 and label == "Positive":

label = "Negative" # unless word guard fires

# 5. Return 8-key response

→ { prediction, confidence, veto_applied,

entropy_flag, top_drivers, timestamp, … }

Veto Verification — End-to-End Validation

"i love having bugs 😭"

Sarcasm — veto fires

Negative ✓

conf 0.9318

"this is so beautiful and amazing 😭"

Pos-overwhelm — word guard suppresses veto

Positive ✓

conf 0.7478

"RIP grandma 😭"

Genuine sad — veto fires correctly

Negative ✓

conf 0.9318

"this is the worst :)"

Known limitation — emoticon sarcasm undetected

Positive ✗

conf 0.8479

Validation Status

"End-to-end validation confirms schema integrity across all prediction categories — 8 keys present in every response, audit trail consistent, veto logic verified on sarcasm and genuine-sad cases. One known limitation remains: emoticon-based sarcasm bypasses the veto since :) is invisible to the Unicode emoji extraction pipeline."


Under the Hood

The model is not a black box. Every prediction is the result of measurable arithmetic — token weights, polarity counts, and a deterministic veto check.

Algorithm Explainer
View Logic Map

01. Extraction

Encoding raw text into a 1,269-dimension matrix.

Vector X = [ 1,265 TF-IDF tokens ] + [ e_pos, e_neg, w_pos, w_neg ]

02. Inference Gate

A deterministic "Sarcasm Veto" that overrides the model.

// Veto Trigger Condition

if (e_neg >= 10 && prediction == POSITIVE):

# Apply Signal Gap Confidence

conf = 0.75 + (0.20 * (e_neg / total_signal))

SUPPRESSION GUARD: Veto is aborted if w_pos >= 2, deferring to statistical model.

03. Statistical Model

Arithmetic logic used only if Gatekeeper allows.

Dot Product (Logit)

z = Σ(wifi) + b

Sum of feature-weight products.

Sigmoid Function

S(z) = 1 / (1 + e-z)

Squashes logit into range.

04. Calibration

Finalizing output status based on probability.

Clear Signal

Conf > 0.70

Ambiguous

Conf < 0.70

Veto Applied

0.75 + (0.20 * Gap)

Scaled Confidence

Worked Examples
Select Input to Expand
01
"My real time happy 😊" Clear Positive

Feature Extraction

emoji_pos_count 10 😊 × 10
word_pos_count 1 "happy"

Inference Trace

"happy" weight +1.6744
word_pos_count +0.8915
Logit (z) → Sigmoid
+2.9657 95.10%
Final Output
POSITIVE | 95.10% Confidence
02
"i love having bugs 😭" Veto Applied

Deterministic Veto Triggered

Statistical Model (Sigmoid) +0.3852 Logit (Positive)

Conflict: Input contains emoji_neg ≥ 10 (😭) while model predicts positive. Positive Drift detected—initiating deterministic override.

Signal Gap Calculation

0.75 + (0.20 * 0.909) → 93.18% Negative Certainty

Final Output
NEGATIVE | 93.18% Confidence | Deterministic Anchor
03
"okay I guess" High Ambiguity

Low Signal Inference

No Polarity Lexicon Hits Detected
"guess" (TF-IDF) +0.2091
"okay" (TF-IDF) +0.1441
Probabilistic Result Uncertainty Flagged
+0.5019 Logit → sigmoid → 62.29%
Final Output
POSITIVE | 62.29% Confidence | Below 0.70 Threshold
04
"This is so beautiful and amazing 😭" VETO ABORTED

Guard Condition Check

emoji_neg: 10 (💀 Triggered)

word_pos: 2 (Guard Met: "beautiful", "amazing")

GATE STATUS: OPEN
Veto aborted. User intent interpreted as sincere sentiment. Proceeding to statistical inference.

Statistical Model Trace

"so" weight-1.0291
word_pos_count+0.8915
"beautiful"+0.1270
model_bias (intercept)+1.7087
Logit (z) → Sigmoid
+1.0867 74.78%
Final Output
POSITIVE | 74.78% Confidence
05
"This is so beautiful 😭" Known Blind Spot

Engineering Trade-off This is a heuristic casualty. While a human might read this as "crying from happiness," the system treats it as sarcasm because the word_pos count fails to hit the "Sincerity Guard" threshold of 2. This blind spot is the necessary collateral for preventing thousands of sarcastic False Positives caused by the model's inherent 14.7% positive drift.

Veto Strike: Statistical Bypass

Signal Status

word_pos: 1 (Insufficient Guard)

e_neg: 10 (Veto Triggered)

Model Result

Positive (Discarded)

Deterministic Confidence Calculation

# Anchoring at 93.18% to override "Positive Drift"

0.75 + (0.20 * 0.9091) = 0.9318

Final Output
NEGATIVE | 93.18% Confidence

inference_ready
● LIVE
pipeline.complete() → awaiting_input

You've seen the math.
Now run the inference.

Every coefficient, every veto condition, every polarity count — all of it executes in under 15ms on your next input.

Execute Classification