Research & Development
Click to expand steps
01
Data Ingestion
The pipeline initiates with an investigative ingestion pass of two complementary data streams. Rather than a simple bulk load, this stage validates the structural integrity of the raw corpora — identifying redundant metadata and non-standard column conventions before establishing a definitive data contract for all downstream processing.
Supervised Corpus
1k_data_emoji_tweets_senti_posneg.csv
- 1,000 labeled social text samples
- Embedded UTF-8 emoji tokens preserved in-place
- Binary sentiment targets (0 = negative, 1 = positive)
Emoji Reference Table
15_emoticon_data.csv
- 16 entries — Unicode codepoints & canonical names
- Structural metadata only — no polarity scores encoded
- Used downstream as an emoji coverage reference
Raw Schema — Tweets
Raw Schema — Emoji Ref
Investigation Output
"Ingestion identified redundant Unnamed: 0 indices
and non-standard post /
sentiment column naming.
These findings served as the direct requirements for the
dataset.py standardisation script
and established the read-only contract: all downstream notebooks consume
from data/processed/ only."
02
Data Cleaning & Validation
Acting on the findings from the ingestion pass, this stage enforces a deterministic cleaning protocol with hard validation gates. Rather than aggressive text transformation, the focus is structural standardisation and schema assertion — ensuring emojis and raw linguistic cues remain intact and unmodified for all downstream analysis.
Structural Alignment
- Standardised Schema text, label
- Type Enforcement str, int
- Null Removal dropna + reset_index
- Ref Table snake_case emoji, unicode_*
Preservation Guardrails
- ✔ Zero-loss emoji retention — no replacement or collapse
- ✔ No semantic transforms — raw text preserved exactly
- ✔ No modeling assumptions introduced at this stage
Hard Validation Assertions
-
assert
Schema is exactly
["label", "text"] -
assert
All labels are strictly binary —
label ∈ {0, 1} -
assert
No empty text fields —
len(text) > 0for all rows - assert All 16 emoji reference entries are unique — no duplicate emoji characters
Pipeline halts if any assertion fails — this stage is gated, not advisory.
Dataset Contract: Two canonical artifacts are written to
data/processed/ —
tweets_clean.csv (1,000 rows × 2 cols) and
emoji_reference_clean.csv (16 rows × 3 cols).
Both are treated as immutable, read-only inputs for all subsequent
feature engineering and modelling passes. Raw data is never accessed again.
03
Exploratory Analysis
Emoji Presence Rate
49.3% of Corpus
Exploratory analysis confirms that nearly half of all tweets contain at least one emoji — making emoji usage common, not incidental. This frequency alone justified proceeding with emoji-aware feature engineering rather than treating symbols as noise.
Label Distribution
Subtle Positive Lean
The corpus shows a natural skew toward positive sentiment (500 pos vs 436 neg). While stable, this 14.7% baseline lean informs the eventual need for deterministic guardrails.
Emoji Distribution
Bimodal & Sparse
~505 tweets carry zero emojis; ~465 carry exactly one. Fewer than 5% contain two or more — emojis act as binary anchors, not intensity counters.
Label Correlation
Positive Bias Anchor
Positive tweets use emojis at a significantly higher rate. As affective amplifiers, these symbols frequently outweigh textual weight, justifying a hybrid logic layer to mitigate positive drift.
Coverage Gap — The Critical Finding
34
Unique emojis
in corpus
16
Emojis in
reference table
12
Intersection
(overlap)
The reference table covers only 12 of the 34 unique emojis found in the corpus. This gap — confirmed here empirically — became the direct motivation for building an expanded polarity lexicon in the feature engineering stage.
Design Decision
"EDA confirmed emoji-aware feature engineering is justified — but under strict constraints: simple, interpretable features only; no dense emoji representations. The subtle 14.7% positive class skew validates the use of a curated polarity lexicon paired with a deterministic logic layer to ensure reliable inference."
04
Emoji Sentiment Referencing
Before committing to any emoji feature design, the pipeline runs a controlled ablation — establishing a text-only baseline, then testing three emoji feature types incrementally. Only features that demonstrably improve on the baseline are retained.
Feature Ablation — Validation F1
Text Only (Baseline)
TF-IDF bigrams, 1,265 features
0.796
Reference
+ Emoji Presence
Binary has_emoji flag
0.775
−0.021 ✕ Rejected
+ Emoji Count
Raw emoji frequency per tweet
0.785
−0.011 ✕ Rejected
+ Emoji Polarity Counts
Separate pos / neg counts via curated lexicon
0.8235
+0.027 ✓ Accepted
# Manually defined polarity sets (label-agnostic)
POSITIVE_EMOJIS = {"😍", "😊", "😁", "😘"}
NEGATIVE_EMOJIS = {"😭", "😧"}
# Count matches per polarity class
pos = sum(1 for c in text if c in POSITIVE_EMOJIS)
neg = sum(1 for c in text if c in NEGATIVE_EMOJIS)
# Stack with TF-IDF → final feature matrix
X_combined = hstack([X_tfidf, [[pos, neg]]])
↑ 1,265 dims + 2 dims = 1,267
Key Finding
"Feature semantics matter more than feature inclusion. Undirected emoji signals — presence or raw count — introduce more noise than signal. Only when direction is resolved (positive vs. negative) do emoji features improve on the text-only baseline. This distinction drives every decision downstream."
05
Feature Engineering
With the polarity mechanism validated in Step 4, this stage asks how far it
can be pushed. Four independent signal sources are engineered, each justified
by corpus evidence, and stacked into a single production feature matrix
that features.py implements exactly.
Expanded Emoji Polarity
Lexicon expanded from 6 to 193 emojis — closing the 15.2% corpus
coverage gap identified in EDA. Produces
emoji_pos_count and
emoji_neg_count.
0.834 F1
+0.038 vs baseline
Emoticon Fallback
Western-style emoticons (:),
:() are invisible to character-level emoji extraction.
Substring matching closes this structural gap.
0.838 F1
+0.042 vs baseline
Word Lexicon
TF-IDF may underweight low-frequency words. A deterministic lexicon
injects stable priors as word_pos_count
and word_neg_count.
0.840 F1
+0.044 vs baseline
Deterministic Sarcasm Veto
Final hybrid layer. If emoji_neg ≥ 10 and model prediction is Positive, the veto forces a Negative override to mitigate positive drift.
X_final = hstack([
X_tfidf, # 1,269 dims
emoji_pol, # 2 dims (×BOOST)
emoticon_pol, # 2 dims (×BOOST)
word_lex, # 2 dims
sarcasm_veto, # Hybrid Logic Layer
]) # → Deterministic Anchor Applied
06
Model Training & Evaluation
The production classifier is Logistic Regression — selected for interpretability and stability on a 1,000-sample dataset. Model selection was closed in NB 3.0; this stage trains on the finalised feature matrix, gates on a hard performance benchmark, and serialises the artifacts.
Validation Performance — 200 Held-Out Samples
0.8402
F1 Score ✓ Benchmark Met
82.5%
Accuracy
Negative Class
Positive Class
27 false positives vs 8 false negatives — the model is generous with positive
predictions. The sarcasm veto in predict.py
directly targets this failure mode.
Learned Feature Weights — Surprising Result
word_neg_count
Word lexicon — negative hits
word_pos_count
Word lexicon — positive hits
emoji_pos_count
Emoji polarity — positive
emoji_neg_count
Emoji polarity — negative
Architectural Implication
"The word lexicon is the single strongest feature in the model at −4.08, outweighing every TF-IDF unigram. Meanwhile, emoji coefficients are surprisingly weak (+0.15, −0.05) — the model treats word signal as more reliable than emoji signal. This is precisely why the sarcasm veto must exist as a deterministic rule rather than a learned weight: the model alone cannot resolve emoji-text conflict."
07
Inference Pipeline
The finalised system is encapsulated into a two-layer inference pipeline. Frozen artifacts guarantee that every live input undergoes the exact transformation sequence validated during training — followed by a deterministic veto layer that overrides the model when emoji-text conflict is detected.
Model Layer — Statistical Prediction
Frozen sentiment_model.pkl and
tfidf_vectorizer.pkl reconstruct
the 1,269-feature matrix at inference time. The model returns a probability
estimate and a provisional label.
Veto Layer — Deterministic Override
If emoji_neg_count > 0 and
the model predicted Positive, the veto fires — overriding to Negative.
A word guard suppresses the veto when positive text overwhelms the emoji signal.
This layer exists because the model's emoji coefficients are too weak
(+0.15, −0.05) to resolve conflict reliably on their own.
# 1. Load frozen artifacts
model = load("sentiment_model.pkl")
tfidf = load("tfidf_vectorizer.pkl")
# 2. Assemble 1,269-feature matrix
X = hstack([tfidf.transform(text), emoji_pol, word_lex])
# 3. Model prediction
proba = model.predict_proba(X) → [0.07, 0.93]
# 4. Veto check
if emoji_neg_count > 0 and label == "Positive":
label = "Negative" # unless word guard fires
# 5. Return 8-key response
→ { prediction, confidence, veto_applied,
entropy_flag, top_drivers, timestamp, … }
Veto Verification — End-to-End Validation
"i love having bugs 😭"
Sarcasm — veto fires
conf 0.9318
"this is so beautiful and amazing 😭"
Pos-overwhelm — word guard suppresses veto
conf 0.7478
"RIP grandma 😭"
Genuine sad — veto fires correctly
conf 0.9318
"this is the worst :)"
Known limitation — emoticon sarcasm undetected
conf 0.8479
Validation Status
"End-to-end validation confirms schema integrity across all prediction
categories — 8 keys present in every response, audit trail consistent,
veto logic verified on sarcasm and genuine-sad cases. One known limitation
remains: emoticon-based sarcasm bypasses the veto since
:) is invisible to the Unicode
emoji extraction pipeline."