Problems while trying to create a new id column based on three criteria?

up vote
3
down vote

favorite

I have a dataframe with conversations and timestamps like this:

timestamp userID textBlob new_id
2018-10-05 23:07:02 01 a large text blob...
2018-10-05 23:07:13 01 a large text blob...
2018-10-05 23:07:23 01 a large text blob...
2018-10-05 23:07:36 01 a large text blob...
2018-10-05 23:08:02 01 a large text blob...
2018-10-05 23:09:16 01 a large text blob...
2018-10-05 23:09:21 01 a large text blob...
2018-10-05 23:09:39 01 a large text blob...
2018-10-05 23:09:47 01 a large text blob...
2018-10-05 23:10:01 01 a large text blob...
2018-10-05 23:10:11 01 a large text blob...
2018-10-05 23:10:23 01 restart 
2018-10-05 23:10:59 01 a large text blob...
2018-10-05 23:11:03 01 a large text blob...
2018-10-08 23:11:32 02 a large text blob...
2018-10-08 23:12:58 02 a large text blob...
2018-10-08 23:13:16 02 a large text blob...
2018-10-08 23:14:04 02 a large text blob...
2018-10-08 03:38:36 02 a large text blob...
2018-10-08 03:38:42 02 a large text blob...
2018-10-08 03:38:52 02 a large text blob...
2018-10-08 03:38:57 02 a large text blob...
2018-10-08 03:39:10 02 a large text blob...
2018-10-08 03:39:27 02 Restart 
2018-10-08 03:40:47 02 a large text blob...
2018-10-08 03:40:54 02 a large text blob...
2018-10-08 03:41:02 02 a large text blob...
2018-10-08 03:41:12 02 a large text blob...
2018-10-08 03:41:32 02 a large text blob...
2018-10-08 03:41:39 02 a large text blob...
2018-10-08 03:42:20 02 a large text blob...
2018-10-08 03:44:58 02 a large text blob...
2018-10-08 03:45:54 02 a large text blob...
2018-10-08 03:46:06 02 a large text blob...
2018-10-08 05:06:42 03 a large text blob...
2018-10-08 05:06:53 03 a large text blob...
2018-10-08 05:08:49 03 a large text blob...
2018-10-08 05:08:58 03 a large text blob...
2018-10-08 05:58:18 04 a large text blob...
2018-10-08 05:58:26 04 a large text blob...
2018-10-08 05:58:37 04 a large text blob...
2018-10-08 05:58:58 04 a large text blob...
2018-10-08 06:00:31 04 a large text blob...
2018-10-08 06:01:00 04 a large text blob...
2018-10-08 06:01:14 04 a large text blob...
2018-10-08 06:02:03 04 a large text blob...
2018-10-08 06:02:03 04 a large text blob...
2018-10-08 06:06:03 04 a large text blob...
2018-10-08 06:10:00 04 a large text blob...
2018-10-08 09:07:03 04 a large text blob...
2018-10-08 09:09:03 04 a large text blob...
2018-10-09 10:01:00 04 a large text blob...
2018-10-09 10:02:00 04 a large text blob...
2018-10-09 10:03:00 04 a large text blob...
2018-10-09 10:09:00 04 a large text blob...
2018-10-09 10:09:00 05 a large text blob...

At the moment I would like to identify with an id the conversations inside the dataframe. The problem is that a user can have several conversations (i.e. an userID can have multiple textBlob associated). Thus, I would like to add a new_id in order to be able to identify the conversations inside the above dataframe.

For this, I would like to create a new_id column based on three criteria:

10 minutes periods

the occurrence of a keyword

when a user doesnt have more textblobs

The expected output looks like this (*):

timestamp userID textBlob new_id
2018-10-05 23:07:02 01 a large text blob... 001
2018-10-05 23:07:13 01 a large text blob... 001
2018-10-05 23:07:23 01 a large text blob... 001
2018-10-05 23:07:36 01 a large text blob... 001
2018-10-05 23:08:02 01 a large text blob... 001
2018-10-05 23:09:16 01 a large text blob... 001
2018-10-05 23:09:21 01 a large text blob... 001
2018-10-05 23:09:39 01 a large text blob... 001
2018-10-05 23:09:47 01 a large text blob... 001
2018-10-05 23:10:01 01 a large text blob... 001
2018-10-05 23:10:11 01 a large text blob... 001
2018-10-05 23:10:23 01 restart 001 ---- (The word restart appeared so a new id is created ↓)
2018-10-05 23:10:59 01 a large text blob... 002
2018-10-05 23:11:03 01 a large text blob... 002
2018-10-08 23:11:32 02 a large text blob... 002
2018-10-08 23:12:58 02 a large text blob... 002
2018-10-08 23:13:16 02 a large text blob... 002
2018-10-08 23:14:04 02 a large text blob... 002 --- (The conversation ends because the 10 minutes time threshold was exceeded)
2018-10-08 03:38:36 02 a large text blob... 003
2018-10-08 03:38:42 02 a large text blob... 003
2018-10-08 03:38:52 02 a large text blob... 003
2018-10-08 03:38:57 02 a large text blob... 003
2018-10-08 03:39:10 02 a large text blob... 003
2018-10-08 03:39:27 02 Restart 003 ---- (The word restart appeared so a new id is created ↓)
2018-10-08 03:40:47 02 a large text blob... 004
2018-10-08 03:40:54 02 a large text blob... 004
2018-10-08 03:41:02 02 a large text blob... 004
2018-10-08 03:41:12 02 a large text blob... 004
2018-10-08 03:41:32 02 a large text blob... 004
2018-10-08 03:41:39 02 a large text blob... 004
2018-10-08 03:42:20 02 a large text blob... 004
2018-10-08 03:44:58 02 a large text blob... 004
2018-10-08 03:45:54 02 a large text blob... 004
2018-10-08 03:46:06 02 a large text blob... 004 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 05:06:42 03 a large text blob... 005
2018-10-08 05:06:53 03 a large text blob... 005
2018-10-08 05:08:49 03 a large text blob... 005
2018-10-08 05:08:58 03 a large text blob... 005 ---- (no more conversations from user id 03, thus the a new id is assigned)
2018-10-08 05:58:18 04 a large text blob... 006
2018-10-08 05:58:26 04 a large text blob... 006
2018-10-08 05:58:37 04 a large text blob... 006
2018-10-08 05:58:58 04 a large text blob... 006
2018-10-08 06:00:31 04 a large text blob... 006
2018-10-08 06:01:00 04 a large text blob... 006
2018-10-08 06:01:14 04 a large text blob... 006
2018-10-08 06:02:03 04 a large text blob... 006 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 06:02:03 04 a large text blob... 007
2018-10-08 06:06:03 04 a large text blob... 007
2018-10-08 06:10:00 04 a large text blob... 007
2018-10-08 09:07:03 04 a large text blob... 007
2018-10-08 09:09:03 04 a large text blob... 007 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-09 10:01:00 04 a large text blob... 008
2018-10-09 10:02:00 04 a large text blob... 008
2018-10-09 10:03:00 04 a large text blob... 008
2018-10-09 10:09:00 04 a large text blob... 008 ---- (no more conversations from user id 04, thus the a new id is assigned)
2018-10-09 10:09:00 05 a large text blob... 010

So far I tried to:

searchfor = ['restart','Restart']
df['keyword_id'] = df['textBlob'].str.contains('|'.join(searchfor))

And

dif = df['timestamp'] - df['timestamp'].shift()
periods = dif > pd.Timedelta('10 min')
times = periods.cumsum().apply(lambda x: x+1)
df['time_id'] = times

However, I also need to consider the userID and I end up with several columns. Is there any way of fulfilling the three conditions and getting the expected output (*)?

edited Nov 11 at 16:51

asked Nov 11 at 11:30

tumbleweed

91682353

1

Should there be a new id assigned between the lines 2018-10-05 23:11:03 and 2018-10-08 03:11:32, since userID changes from 01 to 02? Also, why does the new ID jump from 005 to 007?
– Peter Leimbigler
Nov 11 at 16:42

1

Thanks for the help @PeterLeimbigler, no I made a mistake while generating the data.. I fixed it check again
– tumbleweed
Nov 11 at 16:48

thanks. There is now a jump from 008 to 010, and the lines I mentioned above still don't have a userID increment.
– Peter Leimbigler
Nov 11 at 17:23

add a comment |

up vote
3
down vote

favorite

I have a dataframe with conversations and timestamps like this:

timestamp userID textBlob new_id
2018-10-05 23:07:02 01 a large text blob...
2018-10-05 23:07:13 01 a large text blob...
2018-10-05 23:07:23 01 a large text blob...
2018-10-05 23:07:36 01 a large text blob...
2018-10-05 23:08:02 01 a large text blob...
2018-10-05 23:09:16 01 a large text blob...
2018-10-05 23:09:21 01 a large text blob...
2018-10-05 23:09:39 01 a large text blob...
2018-10-05 23:09:47 01 a large text blob...
2018-10-05 23:10:01 01 a large text blob...
2018-10-05 23:10:11 01 a large text blob...
2018-10-05 23:10:23 01 restart 
2018-10-05 23:10:59 01 a large text blob...
2018-10-05 23:11:03 01 a large text blob...
2018-10-08 23:11:32 02 a large text blob...
2018-10-08 23:12:58 02 a large text blob...
2018-10-08 23:13:16 02 a large text blob...
2018-10-08 23:14:04 02 a large text blob...
2018-10-08 03:38:36 02 a large text blob...
2018-10-08 03:38:42 02 a large text blob...
2018-10-08 03:38:52 02 a large text blob...
2018-10-08 03:38:57 02 a large text blob...
2018-10-08 03:39:10 02 a large text blob...
2018-10-08 03:39:27 02 Restart 
2018-10-08 03:40:47 02 a large text blob...
2018-10-08 03:40:54 02 a large text blob...
2018-10-08 03:41:02 02 a large text blob...
2018-10-08 03:41:12 02 a large text blob...
2018-10-08 03:41:32 02 a large text blob...
2018-10-08 03:41:39 02 a large text blob...
2018-10-08 03:42:20 02 a large text blob...
2018-10-08 03:44:58 02 a large text blob...
2018-10-08 03:45:54 02 a large text blob...
2018-10-08 03:46:06 02 a large text blob...
2018-10-08 05:06:42 03 a large text blob...
2018-10-08 05:06:53 03 a large text blob...
2018-10-08 05:08:49 03 a large text blob...
2018-10-08 05:08:58 03 a large text blob...
2018-10-08 05:58:18 04 a large text blob...
2018-10-08 05:58:26 04 a large text blob...
2018-10-08 05:58:37 04 a large text blob...
2018-10-08 05:58:58 04 a large text blob...
2018-10-08 06:00:31 04 a large text blob...
2018-10-08 06:01:00 04 a large text blob...
2018-10-08 06:01:14 04 a large text blob...
2018-10-08 06:02:03 04 a large text blob...
2018-10-08 06:02:03 04 a large text blob...
2018-10-08 06:06:03 04 a large text blob...
2018-10-08 06:10:00 04 a large text blob...
2018-10-08 09:07:03 04 a large text blob...
2018-10-08 09:09:03 04 a large text blob...
2018-10-09 10:01:00 04 a large text blob...
2018-10-09 10:02:00 04 a large text blob...
2018-10-09 10:03:00 04 a large text blob...
2018-10-09 10:09:00 04 a large text blob...
2018-10-09 10:09:00 05 a large text blob...

For this, I would like to create a new_id column based on three criteria:

10 minutes periods

the occurrence of a keyword

when a user doesnt have more textblobs

The expected output looks like this (*):

timestamp userID textBlob new_id
2018-10-05 23:07:02 01 a large text blob... 001
2018-10-05 23:07:13 01 a large text blob... 001
2018-10-05 23:07:23 01 a large text blob... 001
2018-10-05 23:07:36 01 a large text blob... 001
2018-10-05 23:08:02 01 a large text blob... 001
2018-10-05 23:09:16 01 a large text blob... 001
2018-10-05 23:09:21 01 a large text blob... 001
2018-10-05 23:09:39 01 a large text blob... 001
2018-10-05 23:09:47 01 a large text blob... 001
2018-10-05 23:10:01 01 a large text blob... 001
2018-10-05 23:10:11 01 a large text blob... 001
2018-10-05 23:10:23 01 restart 001 ---- (The word restart appeared so a new id is created ↓)
2018-10-05 23:10:59 01 a large text blob... 002
2018-10-05 23:11:03 01 a large text blob... 002
2018-10-08 23:11:32 02 a large text blob... 002
2018-10-08 23:12:58 02 a large text blob... 002
2018-10-08 23:13:16 02 a large text blob... 002
2018-10-08 23:14:04 02 a large text blob... 002 --- (The conversation ends because the 10 minutes time threshold was exceeded)
2018-10-08 03:38:36 02 a large text blob... 003
2018-10-08 03:38:42 02 a large text blob... 003
2018-10-08 03:38:52 02 a large text blob... 003
2018-10-08 03:38:57 02 a large text blob... 003
2018-10-08 03:39:10 02 a large text blob... 003
2018-10-08 03:39:27 02 Restart 003 ---- (The word restart appeared so a new id is created ↓)
2018-10-08 03:40:47 02 a large text blob... 004
2018-10-08 03:40:54 02 a large text blob... 004
2018-10-08 03:41:02 02 a large text blob... 004
2018-10-08 03:41:12 02 a large text blob... 004
2018-10-08 03:41:32 02 a large text blob... 004
2018-10-08 03:41:39 02 a large text blob... 004
2018-10-08 03:42:20 02 a large text blob... 004
2018-10-08 03:44:58 02 a large text blob... 004
2018-10-08 03:45:54 02 a large text blob... 004
2018-10-08 03:46:06 02 a large text blob... 004 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 05:06:42 03 a large text blob... 005
2018-10-08 05:06:53 03 a large text blob... 005
2018-10-08 05:08:49 03 a large text blob... 005
2018-10-08 05:08:58 03 a large text blob... 005 ---- (no more conversations from user id 03, thus the a new id is assigned)
2018-10-08 05:58:18 04 a large text blob... 006
2018-10-08 05:58:26 04 a large text blob... 006
2018-10-08 05:58:37 04 a large text blob... 006
2018-10-08 05:58:58 04 a large text blob... 006
2018-10-08 06:00:31 04 a large text blob... 006
2018-10-08 06:01:00 04 a large text blob... 006
2018-10-08 06:01:14 04 a large text blob... 006
2018-10-08 06:02:03 04 a large text blob... 006 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 06:02:03 04 a large text blob... 007
2018-10-08 06:06:03 04 a large text blob... 007
2018-10-08 06:10:00 04 a large text blob... 007
2018-10-08 09:07:03 04 a large text blob... 007
2018-10-08 09:09:03 04 a large text blob... 007 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-09 10:01:00 04 a large text blob... 008
2018-10-09 10:02:00 04 a large text blob... 008
2018-10-09 10:03:00 04 a large text blob... 008
2018-10-09 10:09:00 04 a large text blob... 008 ---- (no more conversations from user id 04, thus the a new id is assigned)
2018-10-09 10:09:00 05 a large text blob... 010

So far I tried to:

searchfor = ['restart','Restart']
df['keyword_id'] = df['textBlob'].str.contains('|'.join(searchfor))

And

dif = df['timestamp'] - df['timestamp'].shift()
periods = dif > pd.Timedelta('10 min')
times = periods.cumsum().apply(lambda x: x+1)
df['time_id'] = times

However, I also need to consider the userID and I end up with several columns. Is there any way of fulfilling the three conditions and getting the expected output (*)?

edited Nov 11 at 16:51

asked Nov 11 at 11:30

tumbleweed

91682353

1

Should there be a new id assigned between the lines 2018-10-05 23:11:03 and 2018-10-08 03:11:32, since userID changes from 01 to 02? Also, why does the new ID jump from 005 to 007?
– Peter Leimbigler
Nov 11 at 16:42

1

Thanks for the help @PeterLeimbigler, no I made a mistake while generating the data.. I fixed it check again
– tumbleweed
Nov 11 at 16:48

thanks. There is now a jump from 008 to 010, and the lines I mentioned above still don't have a userID increment.
– Peter Leimbigler
Nov 11 at 17:23

add a comment |

up vote
3
down vote

favorite

I have a dataframe with conversations and timestamps like this:

timestamp userID textBlob new_id
2018-10-05 23:07:02 01 a large text blob...
2018-10-05 23:07:13 01 a large text blob...
2018-10-05 23:07:23 01 a large text blob...
2018-10-05 23:07:36 01 a large text blob...
2018-10-05 23:08:02 01 a large text blob...
2018-10-05 23:09:16 01 a large text blob...
2018-10-05 23:09:21 01 a large text blob...
2018-10-05 23:09:39 01 a large text blob...
2018-10-05 23:09:47 01 a large text blob...
2018-10-05 23:10:01 01 a large text blob...
2018-10-05 23:10:11 01 a large text blob...
2018-10-05 23:10:23 01 restart 
2018-10-05 23:10:59 01 a large text blob...
2018-10-05 23:11:03 01 a large text blob...
2018-10-08 23:11:32 02 a large text blob...
2018-10-08 23:12:58 02 a large text blob...
2018-10-08 23:13:16 02 a large text blob...
2018-10-08 23:14:04 02 a large text blob...
2018-10-08 03:38:36 02 a large text blob...
2018-10-08 03:38:42 02 a large text blob...
2018-10-08 03:38:52 02 a large text blob...
2018-10-08 03:38:57 02 a large text blob...
2018-10-08 03:39:10 02 a large text blob...
2018-10-08 03:39:27 02 Restart 
2018-10-08 03:40:47 02 a large text blob...
2018-10-08 03:40:54 02 a large text blob...
2018-10-08 03:41:02 02 a large text blob...
2018-10-08 03:41:12 02 a large text blob...
2018-10-08 03:41:32 02 a large text blob...
2018-10-08 03:41:39 02 a large text blob...
2018-10-08 03:42:20 02 a large text blob...
2018-10-08 03:44:58 02 a large text blob...
2018-10-08 03:45:54 02 a large text blob...
2018-10-08 03:46:06 02 a large text blob...
2018-10-08 05:06:42 03 a large text blob...
2018-10-08 05:06:53 03 a large text blob...
2018-10-08 05:08:49 03 a large text blob...
2018-10-08 05:08:58 03 a large text blob...
2018-10-08 05:58:18 04 a large text blob...
2018-10-08 05:58:26 04 a large text blob...
2018-10-08 05:58:37 04 a large text blob...
2018-10-08 05:58:58 04 a large text blob...
2018-10-08 06:00:31 04 a large text blob...
2018-10-08 06:01:00 04 a large text blob...
2018-10-08 06:01:14 04 a large text blob...
2018-10-08 06:02:03 04 a large text blob...
2018-10-08 06:02:03 04 a large text blob...
2018-10-08 06:06:03 04 a large text blob...
2018-10-08 06:10:00 04 a large text blob...
2018-10-08 09:07:03 04 a large text blob...
2018-10-08 09:09:03 04 a large text blob...
2018-10-09 10:01:00 04 a large text blob...
2018-10-09 10:02:00 04 a large text blob...
2018-10-09 10:03:00 04 a large text blob...
2018-10-09 10:09:00 04 a large text blob...
2018-10-09 10:09:00 05 a large text blob...

For this, I would like to create a new_id column based on three criteria:

10 minutes periods

the occurrence of a keyword

when a user doesnt have more textblobs

The expected output looks like this (*):

timestamp userID textBlob new_id
2018-10-05 23:07:02 01 a large text blob... 001
2018-10-05 23:07:13 01 a large text blob... 001
2018-10-05 23:07:23 01 a large text blob... 001
2018-10-05 23:07:36 01 a large text blob... 001
2018-10-05 23:08:02 01 a large text blob... 001
2018-10-05 23:09:16 01 a large text blob... 001
2018-10-05 23:09:21 01 a large text blob... 001
2018-10-05 23:09:39 01 a large text blob... 001
2018-10-05 23:09:47 01 a large text blob... 001
2018-10-05 23:10:01 01 a large text blob... 001
2018-10-05 23:10:11 01 a large text blob... 001
2018-10-05 23:10:23 01 restart 001 ---- (The word restart appeared so a new id is created ↓)
2018-10-05 23:10:59 01 a large text blob... 002
2018-10-05 23:11:03 01 a large text blob... 002
2018-10-08 23:11:32 02 a large text blob... 002
2018-10-08 23:12:58 02 a large text blob... 002
2018-10-08 23:13:16 02 a large text blob... 002
2018-10-08 23:14:04 02 a large text blob... 002 --- (The conversation ends because the 10 minutes time threshold was exceeded)
2018-10-08 03:38:36 02 a large text blob... 003
2018-10-08 03:38:42 02 a large text blob... 003
2018-10-08 03:38:52 02 a large text blob... 003
2018-10-08 03:38:57 02 a large text blob... 003
2018-10-08 03:39:10 02 a large text blob... 003
2018-10-08 03:39:27 02 Restart 003 ---- (The word restart appeared so a new id is created ↓)
2018-10-08 03:40:47 02 a large text blob... 004
2018-10-08 03:40:54 02 a large text blob... 004
2018-10-08 03:41:02 02 a large text blob... 004
2018-10-08 03:41:12 02 a large text blob... 004
2018-10-08 03:41:32 02 a large text blob... 004
2018-10-08 03:41:39 02 a large text blob... 004
2018-10-08 03:42:20 02 a large text blob... 004
2018-10-08 03:44:58 02 a large text blob... 004
2018-10-08 03:45:54 02 a large text blob... 004
2018-10-08 03:46:06 02 a large text blob... 004 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 05:06:42 03 a large text blob... 005
2018-10-08 05:06:53 03 a large text blob... 005
2018-10-08 05:08:49 03 a large text blob... 005
2018-10-08 05:08:58 03 a large text blob... 005 ---- (no more conversations from user id 03, thus the a new id is assigned)
2018-10-08 05:58:18 04 a large text blob... 006
2018-10-08 05:58:26 04 a large text blob... 006
2018-10-08 05:58:37 04 a large text blob... 006
2018-10-08 05:58:58 04 a large text blob... 006
2018-10-08 06:00:31 04 a large text blob... 006
2018-10-08 06:01:00 04 a large text blob... 006
2018-10-08 06:01:14 04 a large text blob... 006
2018-10-08 06:02:03 04 a large text blob... 006 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 06:02:03 04 a large text blob... 007
2018-10-08 06:06:03 04 a large text blob... 007
2018-10-08 06:10:00 04 a large text blob... 007
2018-10-08 09:07:03 04 a large text blob... 007
2018-10-08 09:09:03 04 a large text blob... 007 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-09 10:01:00 04 a large text blob... 008
2018-10-09 10:02:00 04 a large text blob... 008
2018-10-09 10:03:00 04 a large text blob... 008
2018-10-09 10:09:00 04 a large text blob... 008 ---- (no more conversations from user id 04, thus the a new id is assigned)
2018-10-09 10:09:00 05 a large text blob... 010

So far I tried to:

searchfor = ['restart','Restart']
df['keyword_id'] = df['textBlob'].str.contains('|'.join(searchfor))

And

dif = df['timestamp'] - df['timestamp'].shift()
periods = dif > pd.Timedelta('10 min')
times = periods.cumsum().apply(lambda x: x+1)
df['time_id'] = times

However, I also need to consider the userID and I end up with several columns. Is there any way of fulfilling the three conditions and getting the expected output (*)?

edited Nov 11 at 16:51

asked Nov 11 at 11:30

tumbleweed

91682353

I have a dataframe with conversations and timestamps like this:

timestamp userID textBlob new_id
2018-10-05 23:07:02 01 a large text blob...
2018-10-05 23:07:13 01 a large text blob...
2018-10-05 23:07:23 01 a large text blob...
2018-10-05 23:07:36 01 a large text blob...
2018-10-05 23:08:02 01 a large text blob...
2018-10-05 23:09:16 01 a large text blob...
2018-10-05 23:09:21 01 a large text blob...
2018-10-05 23:09:39 01 a large text blob...
2018-10-05 23:09:47 01 a large text blob...
2018-10-05 23:10:01 01 a large text blob...
2018-10-05 23:10:11 01 a large text blob...
2018-10-05 23:10:23 01 restart 
2018-10-05 23:10:59 01 a large text blob...
2018-10-05 23:11:03 01 a large text blob...
2018-10-08 23:11:32 02 a large text blob...
2018-10-08 23:12:58 02 a large text blob...
2018-10-08 23:13:16 02 a large text blob...
2018-10-08 23:14:04 02 a large text blob...
2018-10-08 03:38:36 02 a large text blob...
2018-10-08 03:38:42 02 a large text blob...
2018-10-08 03:38:52 02 a large text blob...
2018-10-08 03:38:57 02 a large text blob...
2018-10-08 03:39:10 02 a large text blob...
2018-10-08 03:39:27 02 Restart 
2018-10-08 03:40:47 02 a large text blob...
2018-10-08 03:40:54 02 a large text blob...
2018-10-08 03:41:02 02 a large text blob...
2018-10-08 03:41:12 02 a large text blob...
2018-10-08 03:41:32 02 a large text blob...
2018-10-08 03:41:39 02 a large text blob...
2018-10-08 03:42:20 02 a large text blob...
2018-10-08 03:44:58 02 a large text blob...
2018-10-08 03:45:54 02 a large text blob...
2018-10-08 03:46:06 02 a large text blob...
2018-10-08 05:06:42 03 a large text blob...
2018-10-08 05:06:53 03 a large text blob...
2018-10-08 05:08:49 03 a large text blob...
2018-10-08 05:08:58 03 a large text blob...
2018-10-08 05:58:18 04 a large text blob...
2018-10-08 05:58:26 04 a large text blob...
2018-10-08 05:58:37 04 a large text blob...
2018-10-08 05:58:58 04 a large text blob...
2018-10-08 06:00:31 04 a large text blob...
2018-10-08 06:01:00 04 a large text blob...
2018-10-08 06:01:14 04 a large text blob...
2018-10-08 06:02:03 04 a large text blob...
2018-10-08 06:02:03 04 a large text blob...
2018-10-08 06:06:03 04 a large text blob...
2018-10-08 06:10:00 04 a large text blob...
2018-10-08 09:07:03 04 a large text blob...
2018-10-08 09:09:03 04 a large text blob...
2018-10-09 10:01:00 04 a large text blob...
2018-10-09 10:02:00 04 a large text blob...
2018-10-09 10:03:00 04 a large text blob...
2018-10-09 10:09:00 04 a large text blob...
2018-10-09 10:09:00 05 a large text blob...

For this, I would like to create a new_id column based on three criteria:

10 minutes periods

the occurrence of a keyword

when a user doesnt have more textblobs

The expected output looks like this (*):

timestamp userID textBlob new_id
2018-10-05 23:07:02 01 a large text blob... 001
2018-10-05 23:07:13 01 a large text blob... 001
2018-10-05 23:07:23 01 a large text blob... 001
2018-10-05 23:07:36 01 a large text blob... 001
2018-10-05 23:08:02 01 a large text blob... 001
2018-10-05 23:09:16 01 a large text blob... 001
2018-10-05 23:09:21 01 a large text blob... 001
2018-10-05 23:09:39 01 a large text blob... 001
2018-10-05 23:09:47 01 a large text blob... 001
2018-10-05 23:10:01 01 a large text blob... 001
2018-10-05 23:10:11 01 a large text blob... 001
2018-10-05 23:10:23 01 restart 001 ---- (The word restart appeared so a new id is created ↓)
2018-10-05 23:10:59 01 a large text blob... 002
2018-10-05 23:11:03 01 a large text blob... 002
2018-10-08 23:11:32 02 a large text blob... 002
2018-10-08 23:12:58 02 a large text blob... 002
2018-10-08 23:13:16 02 a large text blob... 002
2018-10-08 23:14:04 02 a large text blob... 002 --- (The conversation ends because the 10 minutes time threshold was exceeded)
2018-10-08 03:38:36 02 a large text blob... 003
2018-10-08 03:38:42 02 a large text blob... 003
2018-10-08 03:38:52 02 a large text blob... 003
2018-10-08 03:38:57 02 a large text blob... 003
2018-10-08 03:39:10 02 a large text blob... 003
2018-10-08 03:39:27 02 Restart 003 ---- (The word restart appeared so a new id is created ↓)
2018-10-08 03:40:47 02 a large text blob... 004
2018-10-08 03:40:54 02 a large text blob... 004
2018-10-08 03:41:02 02 a large text blob... 004
2018-10-08 03:41:12 02 a large text blob... 004
2018-10-08 03:41:32 02 a large text blob... 004
2018-10-08 03:41:39 02 a large text blob... 004
2018-10-08 03:42:20 02 a large text blob... 004
2018-10-08 03:44:58 02 a large text blob... 004
2018-10-08 03:45:54 02 a large text blob... 004
2018-10-08 03:46:06 02 a large text blob... 004 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 05:06:42 03 a large text blob... 005
2018-10-08 05:06:53 03 a large text blob... 005
2018-10-08 05:08:49 03 a large text blob... 005
2018-10-08 05:08:58 03 a large text blob... 005 ---- (no more conversations from user id 03, thus the a new id is assigned)
2018-10-08 05:58:18 04 a large text blob... 006
2018-10-08 05:58:26 04 a large text blob... 006
2018-10-08 05:58:37 04 a large text blob... 006
2018-10-08 05:58:58 04 a large text blob... 006
2018-10-08 06:00:31 04 a large text blob... 006
2018-10-08 06:01:00 04 a large text blob... 006
2018-10-08 06:01:14 04 a large text blob... 006
2018-10-08 06:02:03 04 a large text blob... 006 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-08 06:02:03 04 a large text blob... 007
2018-10-08 06:06:03 04 a large text blob... 007
2018-10-08 06:10:00 04 a large text blob... 007
2018-10-08 09:07:03 04 a large text blob... 007
2018-10-08 09:09:03 04 a large text blob... 007 ---- (The 10 minutes threshold is exceeded a new id is assigned ↓)
2018-10-09 10:01:00 04 a large text blob... 008
2018-10-09 10:02:00 04 a large text blob... 008
2018-10-09 10:03:00 04 a large text blob... 008
2018-10-09 10:09:00 04 a large text blob... 008 ---- (no more conversations from user id 04, thus the a new id is assigned)
2018-10-09 10:09:00 05 a large text blob... 010

So far I tried to:

searchfor = ['restart','Restart']
df['keyword_id'] = df['textBlob'].str.contains('|'.join(searchfor))

And

dif = df['timestamp'] - df['timestamp'].shift()
periods = dif > pd.Timedelta('10 min')
times = periods.cumsum().apply(lambda x: x+1)
df['time_id'] = times

However, I also need to consider the userID and I end up with several columns. Is there any way of fulfilling the three conditions and getting the expected output (*)?

python pandas dataframe

edited Nov 11 at 16:51

asked Nov 11 at 11:30

tumbleweed

91682353

edited Nov 11 at 16:51

asked Nov 11 at 11:30

tumbleweed

91682353

edited Nov 11 at 16:51

asked Nov 11 at 11:30

tumbleweed

91682353

asked Nov 11 at 11:30

tumbleweed

91682353

asked Nov 11 at 11:30

tumbleweed

91682353

1

Should there be a new id assigned between the lines 2018-10-05 23:11:03 and 2018-10-08 03:11:32, since userID changes from 01 to 02? Also, why does the new ID jump from 005 to 007?
– Peter Leimbigler
Nov 11 at 16:42

1

Thanks for the help @PeterLeimbigler, no I made a mistake while generating the data.. I fixed it check again
– tumbleweed
Nov 11 at 16:48

thanks. There is now a jump from 008 to 010, and the lines I mentioned above still don't have a userID increment.
– Peter Leimbigler
Nov 11 at 17:23

add a comment |

1

Should there be a new id assigned between the lines 2018-10-05 23:11:03 and 2018-10-08 03:11:32, since userID changes from 01 to 02? Also, why does the new ID jump from 005 to 007?
– Peter Leimbigler
Nov 11 at 16:42

1

Thanks for the help @PeterLeimbigler, no I made a mistake while generating the data.. I fixed it check again
– tumbleweed
Nov 11 at 16:48

thanks. There is now a jump from 008 to 010, and the lines I mentioned above still don't have a userID increment.
– Peter Leimbigler
Nov 11 at 17:23

Should there be a new id assigned between the lines 2018-10-05 23:11:03 and 2018-10-08 03:11:32, since userID changes from 01 to 02? Also, why does the new ID jump from 005 to 007?
– Peter Leimbigler
Nov 11 at 16:42

Thanks for the help @PeterLeimbigler, no I made a mistake while generating the data.. I fixed it check again
– tumbleweed
Nov 11 at 16:48

thanks. There is now a jump from 008 to 010, and the lines I mentioned above still don't have a userID increment.
– Peter Leimbigler
Nov 11 at 17:23

add a comment |

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

You're most of the way there. To put it all together, build a boolean mask for each condition, then convert the masks to int and take their cumulative sum:

mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm') 
mask2 = df['userID'].diff() != 0
mask3 = df['textBlob'].shift().str.lower() == 'restart'

df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()

# Result:
print(df.to_string(index=False))

timestamp userID textBlob new_id
2018-10-05 23:07:02 1 a_large_text_blob... 1
2018-10-05 23:07:13 1 a_large_text_blob... 1
2018-10-05 23:07:23 1 a_large_text_blob... 1
2018-10-05 23:07:36 1 a_large_text_blob... 1
2018-10-05 23:08:02 1 a_large_text_blob... 1
2018-10-05 23:09:16 1 a_large_text_blob... 1
2018-10-05 23:09:21 1 a_large_text_blob... 1
2018-10-05 23:09:39 1 a_large_text_blob... 1
2018-10-05 23:09:47 1 a_large_text_blob... 1
2018-10-05 23:10:01 1 a_large_text_blob... 1
2018-10-05 23:10:11 1 a_large_text_blob... 1
2018-10-05 23:10:23 1 restart 1
2018-10-05 23:10:59 1 a_large_text_blob... 2
2018-10-05 23:11:03 1 a_large_text_blob... 2
2018-10-08 03:11:32 2 a_large_text_blob... 3
2018-10-08 03:12:58 2 a_large_text_blob... 3
2018-10-08 03:13:16 2 a_large_text_blob... 3
2018-10-08 03:14:04 2 a_large_text_blob... 3
2018-10-08 03:38:36 2 a_large_text_blob... 4
2018-10-08 03:38:42 2 a_large_text_blob... 4
2018-10-08 03:38:52 2 a_large_text_blob... 4
2018-10-08 03:38:57 2 a_large_text_blob... 4
2018-10-08 03:39:10 2 a_large_text_blob... 4
2018-10-08 03:39:27 2 Restart 4
2018-10-08 03:40:47 2 a_large_text_blob... 5
2018-10-08 03:40:54 2 a_large_text_blob... 5
2018-10-08 03:41:02 2 a_large_text_blob... 5
2018-10-08 03:41:12 2 a_large_text_blob... 5
2018-10-08 03:41:32 2 a_large_text_blob... 5
2018-10-08 03:41:39 2 a_large_text_blob... 5
2018-10-08 03:42:20 2 a_large_text_blob... 5
2018-10-08 03:44:58 2 a_large_text_blob... 5
2018-10-08 03:45:54 2 a_large_text_blob... 5
2018-10-08 03:46:06 2 a_large_text_blob... 5
2018-10-08 05:06:42 3 a_large_text_blob... 6
2018-10-08 05:06:53 3 a_large_text_blob... 6
2018-10-08 05:08:49 3 a_large_text_blob... 6
2018-10-08 05:08:58 3 a_large_text_blob... 6
2018-10-08 05:58:18 4 a_large_text_blob... 7
2018-10-08 05:58:26 4 a_large_text_blob... 7
2018-10-08 05:58:37 4 a_large_text_blob... 7
2018-10-08 05:58:58 4 a_large_text_blob... 7
2018-10-08 06:00:31 4 a_large_text_blob... 7
2018-10-08 06:01:00 4 a_large_text_blob... 7
2018-10-08 06:01:14 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:06:03 4 a_large_text_blob... 7
2018-10-08 06:10:00 4 a_large_text_blob... 7
2018-10-08 09:07:03 4 a_large_text_blob... 8
2018-10-08 09:09:03 4 a_large_text_blob... 8
2018-10-09 10:01:00 4 a_large_text_blob... 9
2018-10-09 10:02:00 4 a_large_text_blob... 9
2018-10-09 10:03:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 5 a_large_text_blob... 10

edited Nov 11 at 17:24

answered Nov 11 at 16:51

Peter Leimbigler

3,6941415

I updated the data, thanks for the help
– tumbleweed
Nov 11 at 16:53

1

@tumbleweed, I've adjusted the restart logic, so this output should be what you're looking for. Let me know if I'm still missing anything.
– Peter Leimbigler
Nov 11 at 17:24

add a comment |

up vote
0
down vote

Ok I thought the 10 minutes period should count from the beginning of the conversation, not from the immediate below message, in that case you would need to iterate over the rows like:

df['timestamp'] = pd.to_datetime(df['timestamp'])
restart = df.textBlob.str.contains('|'.join(['restart','Restart']))
user_change = df.userID == df.userID.shift().fillna(method='bfill')
df['new_id'] = (restart | ~user_change).cumsum()
current_id = 0
new_id_prev = 0
start_time = df.timestamp.iloc[0]

for i, new_id, timestamp in zip(range(len(df)), df.new_id, df.timestamp):
 timedelta = timestamp - start_time

 if new_id != new_id_prev or timedelta > pd.Timedelta(10,unit='m'):
 current_id += 1
 start_time = timestamp

 new_id_prev = new_id 
 df.new_id.iloc[i] = current_id

answered Nov 11 at 17:04

Franco Piccolo

1,325611

1

Thanks for the help!
– tumbleweed
Nov 11 at 17:54

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53248280%2fproblems-while-trying-to-create-a-new-id-column-based-on-three-criteria%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
1
down vote

accepted

You're most of the way there. To put it all together, build a boolean mask for each condition, then convert the masks to int and take their cumulative sum:

mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm') 
mask2 = df['userID'].diff() != 0
mask3 = df['textBlob'].shift().str.lower() == 'restart'

df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()

# Result:
print(df.to_string(index=False))

timestamp userID textBlob new_id
2018-10-05 23:07:02 1 a_large_text_blob... 1
2018-10-05 23:07:13 1 a_large_text_blob... 1
2018-10-05 23:07:23 1 a_large_text_blob... 1
2018-10-05 23:07:36 1 a_large_text_blob... 1
2018-10-05 23:08:02 1 a_large_text_blob... 1
2018-10-05 23:09:16 1 a_large_text_blob... 1
2018-10-05 23:09:21 1 a_large_text_blob... 1
2018-10-05 23:09:39 1 a_large_text_blob... 1
2018-10-05 23:09:47 1 a_large_text_blob... 1
2018-10-05 23:10:01 1 a_large_text_blob... 1
2018-10-05 23:10:11 1 a_large_text_blob... 1
2018-10-05 23:10:23 1 restart 1
2018-10-05 23:10:59 1 a_large_text_blob... 2
2018-10-05 23:11:03 1 a_large_text_blob... 2
2018-10-08 03:11:32 2 a_large_text_blob... 3
2018-10-08 03:12:58 2 a_large_text_blob... 3
2018-10-08 03:13:16 2 a_large_text_blob... 3
2018-10-08 03:14:04 2 a_large_text_blob... 3
2018-10-08 03:38:36 2 a_large_text_blob... 4
2018-10-08 03:38:42 2 a_large_text_blob... 4
2018-10-08 03:38:52 2 a_large_text_blob... 4
2018-10-08 03:38:57 2 a_large_text_blob... 4
2018-10-08 03:39:10 2 a_large_text_blob... 4
2018-10-08 03:39:27 2 Restart 4
2018-10-08 03:40:47 2 a_large_text_blob... 5
2018-10-08 03:40:54 2 a_large_text_blob... 5
2018-10-08 03:41:02 2 a_large_text_blob... 5
2018-10-08 03:41:12 2 a_large_text_blob... 5
2018-10-08 03:41:32 2 a_large_text_blob... 5
2018-10-08 03:41:39 2 a_large_text_blob... 5
2018-10-08 03:42:20 2 a_large_text_blob... 5
2018-10-08 03:44:58 2 a_large_text_blob... 5
2018-10-08 03:45:54 2 a_large_text_blob... 5
2018-10-08 03:46:06 2 a_large_text_blob... 5
2018-10-08 05:06:42 3 a_large_text_blob... 6
2018-10-08 05:06:53 3 a_large_text_blob... 6
2018-10-08 05:08:49 3 a_large_text_blob... 6
2018-10-08 05:08:58 3 a_large_text_blob... 6
2018-10-08 05:58:18 4 a_large_text_blob... 7
2018-10-08 05:58:26 4 a_large_text_blob... 7
2018-10-08 05:58:37 4 a_large_text_blob... 7
2018-10-08 05:58:58 4 a_large_text_blob... 7
2018-10-08 06:00:31 4 a_large_text_blob... 7
2018-10-08 06:01:00 4 a_large_text_blob... 7
2018-10-08 06:01:14 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:06:03 4 a_large_text_blob... 7
2018-10-08 06:10:00 4 a_large_text_blob... 7
2018-10-08 09:07:03 4 a_large_text_blob... 8
2018-10-08 09:09:03 4 a_large_text_blob... 8
2018-10-09 10:01:00 4 a_large_text_blob... 9
2018-10-09 10:02:00 4 a_large_text_blob... 9
2018-10-09 10:03:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 5 a_large_text_blob... 10

edited Nov 11 at 17:24

answered Nov 11 at 16:51

Peter Leimbigler

3,6941415

I updated the data, thanks for the help
– tumbleweed
Nov 11 at 16:53

1

@tumbleweed, I've adjusted the restart logic, so this output should be what you're looking for. Let me know if I'm still missing anything.
– Peter Leimbigler
Nov 11 at 17:24

add a comment |

up vote
1
down vote

accepted

You're most of the way there. To put it all together, build a boolean mask for each condition, then convert the masks to int and take their cumulative sum:

mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm') 
mask2 = df['userID'].diff() != 0
mask3 = df['textBlob'].shift().str.lower() == 'restart'

df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()

# Result:
print(df.to_string(index=False))

timestamp userID textBlob new_id
2018-10-05 23:07:02 1 a_large_text_blob... 1
2018-10-05 23:07:13 1 a_large_text_blob... 1
2018-10-05 23:07:23 1 a_large_text_blob... 1
2018-10-05 23:07:36 1 a_large_text_blob... 1
2018-10-05 23:08:02 1 a_large_text_blob... 1
2018-10-05 23:09:16 1 a_large_text_blob... 1
2018-10-05 23:09:21 1 a_large_text_blob... 1
2018-10-05 23:09:39 1 a_large_text_blob... 1
2018-10-05 23:09:47 1 a_large_text_blob... 1
2018-10-05 23:10:01 1 a_large_text_blob... 1
2018-10-05 23:10:11 1 a_large_text_blob... 1
2018-10-05 23:10:23 1 restart 1
2018-10-05 23:10:59 1 a_large_text_blob... 2
2018-10-05 23:11:03 1 a_large_text_blob... 2
2018-10-08 03:11:32 2 a_large_text_blob... 3
2018-10-08 03:12:58 2 a_large_text_blob... 3
2018-10-08 03:13:16 2 a_large_text_blob... 3
2018-10-08 03:14:04 2 a_large_text_blob... 3
2018-10-08 03:38:36 2 a_large_text_blob... 4
2018-10-08 03:38:42 2 a_large_text_blob... 4
2018-10-08 03:38:52 2 a_large_text_blob... 4
2018-10-08 03:38:57 2 a_large_text_blob... 4
2018-10-08 03:39:10 2 a_large_text_blob... 4
2018-10-08 03:39:27 2 Restart 4
2018-10-08 03:40:47 2 a_large_text_blob... 5
2018-10-08 03:40:54 2 a_large_text_blob... 5
2018-10-08 03:41:02 2 a_large_text_blob... 5
2018-10-08 03:41:12 2 a_large_text_blob... 5
2018-10-08 03:41:32 2 a_large_text_blob... 5
2018-10-08 03:41:39 2 a_large_text_blob... 5
2018-10-08 03:42:20 2 a_large_text_blob... 5
2018-10-08 03:44:58 2 a_large_text_blob... 5
2018-10-08 03:45:54 2 a_large_text_blob... 5
2018-10-08 03:46:06 2 a_large_text_blob... 5
2018-10-08 05:06:42 3 a_large_text_blob... 6
2018-10-08 05:06:53 3 a_large_text_blob... 6
2018-10-08 05:08:49 3 a_large_text_blob... 6
2018-10-08 05:08:58 3 a_large_text_blob... 6
2018-10-08 05:58:18 4 a_large_text_blob... 7
2018-10-08 05:58:26 4 a_large_text_blob... 7
2018-10-08 05:58:37 4 a_large_text_blob... 7
2018-10-08 05:58:58 4 a_large_text_blob... 7
2018-10-08 06:00:31 4 a_large_text_blob... 7
2018-10-08 06:01:00 4 a_large_text_blob... 7
2018-10-08 06:01:14 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:06:03 4 a_large_text_blob... 7
2018-10-08 06:10:00 4 a_large_text_blob... 7
2018-10-08 09:07:03 4 a_large_text_blob... 8
2018-10-08 09:09:03 4 a_large_text_blob... 8
2018-10-09 10:01:00 4 a_large_text_blob... 9
2018-10-09 10:02:00 4 a_large_text_blob... 9
2018-10-09 10:03:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 5 a_large_text_blob... 10

edited Nov 11 at 17:24

answered Nov 11 at 16:51

Peter Leimbigler

3,6941415

I updated the data, thanks for the help
– tumbleweed
Nov 11 at 16:53

1

@tumbleweed, I've adjusted the restart logic, so this output should be what you're looking for. Let me know if I'm still missing anything.
– Peter Leimbigler
Nov 11 at 17:24

add a comment |

up vote
1
down vote

accepted

You're most of the way there. To put it all together, build a boolean mask for each condition, then convert the masks to int and take their cumulative sum:

mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm') 
mask2 = df['userID'].diff() != 0
mask3 = df['textBlob'].shift().str.lower() == 'restart'

df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()

# Result:
print(df.to_string(index=False))

timestamp userID textBlob new_id
2018-10-05 23:07:02 1 a_large_text_blob... 1
2018-10-05 23:07:13 1 a_large_text_blob... 1
2018-10-05 23:07:23 1 a_large_text_blob... 1
2018-10-05 23:07:36 1 a_large_text_blob... 1
2018-10-05 23:08:02 1 a_large_text_blob... 1
2018-10-05 23:09:16 1 a_large_text_blob... 1
2018-10-05 23:09:21 1 a_large_text_blob... 1
2018-10-05 23:09:39 1 a_large_text_blob... 1
2018-10-05 23:09:47 1 a_large_text_blob... 1
2018-10-05 23:10:01 1 a_large_text_blob... 1
2018-10-05 23:10:11 1 a_large_text_blob... 1
2018-10-05 23:10:23 1 restart 1
2018-10-05 23:10:59 1 a_large_text_blob... 2
2018-10-05 23:11:03 1 a_large_text_blob... 2
2018-10-08 03:11:32 2 a_large_text_blob... 3
2018-10-08 03:12:58 2 a_large_text_blob... 3
2018-10-08 03:13:16 2 a_large_text_blob... 3
2018-10-08 03:14:04 2 a_large_text_blob... 3
2018-10-08 03:38:36 2 a_large_text_blob... 4
2018-10-08 03:38:42 2 a_large_text_blob... 4
2018-10-08 03:38:52 2 a_large_text_blob... 4
2018-10-08 03:38:57 2 a_large_text_blob... 4
2018-10-08 03:39:10 2 a_large_text_blob... 4
2018-10-08 03:39:27 2 Restart 4
2018-10-08 03:40:47 2 a_large_text_blob... 5
2018-10-08 03:40:54 2 a_large_text_blob... 5
2018-10-08 03:41:02 2 a_large_text_blob... 5
2018-10-08 03:41:12 2 a_large_text_blob... 5
2018-10-08 03:41:32 2 a_large_text_blob... 5
2018-10-08 03:41:39 2 a_large_text_blob... 5
2018-10-08 03:42:20 2 a_large_text_blob... 5
2018-10-08 03:44:58 2 a_large_text_blob... 5
2018-10-08 03:45:54 2 a_large_text_blob... 5
2018-10-08 03:46:06 2 a_large_text_blob... 5
2018-10-08 05:06:42 3 a_large_text_blob... 6
2018-10-08 05:06:53 3 a_large_text_blob... 6
2018-10-08 05:08:49 3 a_large_text_blob... 6
2018-10-08 05:08:58 3 a_large_text_blob... 6
2018-10-08 05:58:18 4 a_large_text_blob... 7
2018-10-08 05:58:26 4 a_large_text_blob... 7
2018-10-08 05:58:37 4 a_large_text_blob... 7
2018-10-08 05:58:58 4 a_large_text_blob... 7
2018-10-08 06:00:31 4 a_large_text_blob... 7
2018-10-08 06:01:00 4 a_large_text_blob... 7
2018-10-08 06:01:14 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:06:03 4 a_large_text_blob... 7
2018-10-08 06:10:00 4 a_large_text_blob... 7
2018-10-08 09:07:03 4 a_large_text_blob... 8
2018-10-08 09:09:03 4 a_large_text_blob... 8
2018-10-09 10:01:00 4 a_large_text_blob... 9
2018-10-09 10:02:00 4 a_large_text_blob... 9
2018-10-09 10:03:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 5 a_large_text_blob... 10

edited Nov 11 at 17:24

answered Nov 11 at 16:51

Peter Leimbigler

3,6941415

You're most of the way there. To put it all together, build a boolean mask for each condition, then convert the masks to int and take their cumulative sum:

mask1 = df.timestamp.diff() > pd.Timedelta(10, 'm') 
mask2 = df['userID'].diff() != 0
mask3 = df['textBlob'].shift().str.lower() == 'restart'

df['new_id'] = (mask1 | mask2 | mask3).astype(int).cumsum()

# Result:
print(df.to_string(index=False))

timestamp userID textBlob new_id
2018-10-05 23:07:02 1 a_large_text_blob... 1
2018-10-05 23:07:13 1 a_large_text_blob... 1
2018-10-05 23:07:23 1 a_large_text_blob... 1
2018-10-05 23:07:36 1 a_large_text_blob... 1
2018-10-05 23:08:02 1 a_large_text_blob... 1
2018-10-05 23:09:16 1 a_large_text_blob... 1
2018-10-05 23:09:21 1 a_large_text_blob... 1
2018-10-05 23:09:39 1 a_large_text_blob... 1
2018-10-05 23:09:47 1 a_large_text_blob... 1
2018-10-05 23:10:01 1 a_large_text_blob... 1
2018-10-05 23:10:11 1 a_large_text_blob... 1
2018-10-05 23:10:23 1 restart 1
2018-10-05 23:10:59 1 a_large_text_blob... 2
2018-10-05 23:11:03 1 a_large_text_blob... 2
2018-10-08 03:11:32 2 a_large_text_blob... 3
2018-10-08 03:12:58 2 a_large_text_blob... 3
2018-10-08 03:13:16 2 a_large_text_blob... 3
2018-10-08 03:14:04 2 a_large_text_blob... 3
2018-10-08 03:38:36 2 a_large_text_blob... 4
2018-10-08 03:38:42 2 a_large_text_blob... 4
2018-10-08 03:38:52 2 a_large_text_blob... 4
2018-10-08 03:38:57 2 a_large_text_blob... 4
2018-10-08 03:39:10 2 a_large_text_blob... 4
2018-10-08 03:39:27 2 Restart 4
2018-10-08 03:40:47 2 a_large_text_blob... 5
2018-10-08 03:40:54 2 a_large_text_blob... 5
2018-10-08 03:41:02 2 a_large_text_blob... 5
2018-10-08 03:41:12 2 a_large_text_blob... 5
2018-10-08 03:41:32 2 a_large_text_blob... 5
2018-10-08 03:41:39 2 a_large_text_blob... 5
2018-10-08 03:42:20 2 a_large_text_blob... 5
2018-10-08 03:44:58 2 a_large_text_blob... 5
2018-10-08 03:45:54 2 a_large_text_blob... 5
2018-10-08 03:46:06 2 a_large_text_blob... 5
2018-10-08 05:06:42 3 a_large_text_blob... 6
2018-10-08 05:06:53 3 a_large_text_blob... 6
2018-10-08 05:08:49 3 a_large_text_blob... 6
2018-10-08 05:08:58 3 a_large_text_blob... 6
2018-10-08 05:58:18 4 a_large_text_blob... 7
2018-10-08 05:58:26 4 a_large_text_blob... 7
2018-10-08 05:58:37 4 a_large_text_blob... 7
2018-10-08 05:58:58 4 a_large_text_blob... 7
2018-10-08 06:00:31 4 a_large_text_blob... 7
2018-10-08 06:01:00 4 a_large_text_blob... 7
2018-10-08 06:01:14 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:02:03 4 a_large_text_blob... 7
2018-10-08 06:06:03 4 a_large_text_blob... 7
2018-10-08 06:10:00 4 a_large_text_blob... 7
2018-10-08 09:07:03 4 a_large_text_blob... 8
2018-10-08 09:09:03 4 a_large_text_blob... 8
2018-10-09 10:01:00 4 a_large_text_blob... 9
2018-10-09 10:02:00 4 a_large_text_blob... 9
2018-10-09 10:03:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 4 a_large_text_blob... 9
2018-10-09 10:09:00 5 a_large_text_blob... 10

edited Nov 11 at 17:24

answered Nov 11 at 16:51

Peter Leimbigler

3,6941415

edited Nov 11 at 17:24

answered Nov 11 at 16:51

Peter Leimbigler

3,6941415

answered Nov 11 at 16:51

Peter Leimbigler

3,6941415

answered Nov 11 at 16:51

Peter Leimbigler

3,6941415

I updated the data, thanks for the help
– tumbleweed
Nov 11 at 16:53

1

@tumbleweed, I've adjusted the restart logic, so this output should be what you're looking for. Let me know if I'm still missing anything.
– Peter Leimbigler
Nov 11 at 17:24

add a comment |

I updated the data, thanks for the help
– tumbleweed
Nov 11 at 16:53

1

@tumbleweed, I've adjusted the restart logic, so this output should be what you're looking for. Let me know if I'm still missing anything.
– Peter Leimbigler
Nov 11 at 17:24

I updated the data, thanks for the help
– tumbleweed
Nov 11 at 16:53

@tumbleweed, I've adjusted the restart logic, so this output should be what you're looking for. Let me know if I'm still missing anything.
– Peter Leimbigler
Nov 11 at 17:24

add a comment |

up vote
0
down vote

Ok I thought the 10 minutes period should count from the beginning of the conversation, not from the immediate below message, in that case you would need to iterate over the rows like:

df['timestamp'] = pd.to_datetime(df['timestamp'])
restart = df.textBlob.str.contains('|'.join(['restart','Restart']))
user_change = df.userID == df.userID.shift().fillna(method='bfill')
df['new_id'] = (restart | ~user_change).cumsum()
current_id = 0
new_id_prev = 0
start_time = df.timestamp.iloc[0]

for i, new_id, timestamp in zip(range(len(df)), df.new_id, df.timestamp):
 timedelta = timestamp - start_time

 if new_id != new_id_prev or timedelta > pd.Timedelta(10,unit='m'):
 current_id += 1
 start_time = timestamp

 new_id_prev = new_id 
 df.new_id.iloc[i] = current_id

answered Nov 11 at 17:04

Franco Piccolo

1,325611

1

Thanks for the help!
– tumbleweed
Nov 11 at 17:54

add a comment |

up vote
0
down vote

Ok I thought the 10 minutes period should count from the beginning of the conversation, not from the immediate below message, in that case you would need to iterate over the rows like:

df['timestamp'] = pd.to_datetime(df['timestamp'])
restart = df.textBlob.str.contains('|'.join(['restart','Restart']))
user_change = df.userID == df.userID.shift().fillna(method='bfill')
df['new_id'] = (restart | ~user_change).cumsum()
current_id = 0
new_id_prev = 0
start_time = df.timestamp.iloc[0]

for i, new_id, timestamp in zip(range(len(df)), df.new_id, df.timestamp):
 timedelta = timestamp - start_time

 if new_id != new_id_prev or timedelta > pd.Timedelta(10,unit='m'):
 current_id += 1
 start_time = timestamp

 new_id_prev = new_id 
 df.new_id.iloc[i] = current_id

answered Nov 11 at 17:04

Franco Piccolo

1,325611

1

Thanks for the help!
– tumbleweed
Nov 11 at 17:54

add a comment |

up vote
0
down vote

Ok I thought the 10 minutes period should count from the beginning of the conversation, not from the immediate below message, in that case you would need to iterate over the rows like:

df['timestamp'] = pd.to_datetime(df['timestamp'])
restart = df.textBlob.str.contains('|'.join(['restart','Restart']))
user_change = df.userID == df.userID.shift().fillna(method='bfill')
df['new_id'] = (restart | ~user_change).cumsum()
current_id = 0
new_id_prev = 0
start_time = df.timestamp.iloc[0]

for i, new_id, timestamp in zip(range(len(df)), df.new_id, df.timestamp):
 timedelta = timestamp - start_time

 if new_id != new_id_prev or timedelta > pd.Timedelta(10,unit='m'):
 current_id += 1
 start_time = timestamp

 new_id_prev = new_id 
 df.new_id.iloc[i] = current_id

answered Nov 11 at 17:04

Franco Piccolo

1,325611

Ok I thought the 10 minutes period should count from the beginning of the conversation, not from the immediate below message, in that case you would need to iterate over the rows like:

df['timestamp'] = pd.to_datetime(df['timestamp'])
restart = df.textBlob.str.contains('|'.join(['restart','Restart']))
user_change = df.userID == df.userID.shift().fillna(method='bfill')
df['new_id'] = (restart | ~user_change).cumsum()
current_id = 0
new_id_prev = 0
start_time = df.timestamp.iloc[0]

for i, new_id, timestamp in zip(range(len(df)), df.new_id, df.timestamp):
 timedelta = timestamp - start_time

 if new_id != new_id_prev or timedelta > pd.Timedelta(10,unit='m'):
 current_id += 1
 start_time = timestamp

 new_id_prev = new_id 
 df.new_id.iloc[i] = current_id

answered Nov 11 at 17:04

Franco Piccolo

1,325611

answered Nov 11 at 17:04

Franco Piccolo

1,325611

answered Nov 11 at 17:04

Franco Piccolo

1,325611

answered Nov 11 at 17:04

Franco Piccolo

1,325611

1

Thanks for the help!
– tumbleweed
Nov 11 at 17:54

add a comment |

1

Thanks for the help!
– tumbleweed
Nov 11 at 17:54

Thanks for the help!
– tumbleweed
Nov 11 at 17:54

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Myujth